greener big data: optimizing data exchange and power ... · fashion. in particular, the rise of the...
TRANSCRIPT
Greener Big Data: Optimizing Data Exchange and PowerProcurement for Data Centers
by
Zakia Asad
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2017 by Zakia Asad
Abstract
Greener Big Data: Optimizing Data Exchange and Power Procurement for Data
Centers
Zakia Asad
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2017
Big data is revolutionizing modern day life at an environmental cost due to the “un-
green” practices associated with the operation of data centers, a workhorse for big data
processing. Turning the big data enterprise into a greener enterprise requires two funda-
mental steps. The first step is curtailing resource consumption, and making data centers
energy efficient. The second step is using greener energy sources to power data centers.
Big data processing has challenged the ability of data centers to operate in a greener
fashion. In particular, the rise of the cloud and distributed data-intensive (“big data”)
applications puts pressure on data center networks due to the movement of massive
volumes of data. Reducing the volume of communication is crucial for embracing greener
data exchange by efficient utilization of the resources. This thesis proposes the use of a
mixing technique, spate coding, as a means of dynamically-controlled reduction in volume
of communication. We introduce real-world use-cases, and present a novel spate coding
algorithm for the data center networks. We also analyze the computational complexity
of minimizing the volume of communication in a distributed data center application, and
provide theoretical limits. Moreover, we proceed to bridge the gap between theory and
practice by performing a proof-of-concept implementation of the proposed system in a
real-world data center. We use MapReduce, the most widely used Big Data processing
framework, as our target. Experimental results employing industry standard benchmarks
ii
show the advantage of our proposed system compared to the current state-of-the-art.
Another important step in promoting green practices for the big data enterprise is
reshaping the energy mix required to operate data centers. In this context, we con-
sider power procurement for data centers while incorporating the green initiatives and
cyber-physical constraints imposed by evolving power systems. Moreover, we consider
implications of our formulation to solution complexity. We use the concept of matroid
theory to model the problem as a combinatorial problem and propose a distributed and
optimal solution. We account for communications and computational complexity and
provide solutions for practical scenarios.
In the end, we propose a futuristic cross-plane green orchestrator.
iii
iv
Dedicated to my love, Asad
v
Acknowledgements
I owe all to my God, the Lord of the heavens and the earth, the compassionate and the
merciful. I thank Him for opening up my heart and mind to the wonders of knowledge.
It is with immense gratitude that I acknowledge the valuable support, availability, and
help of Prof. Frank Kschschiang. Long discussions with him have been very enriching.
His intriguing questions provided me with an opportunity to better organize my thoughts
as well as writing. I like to credit Prof. Cristiana Amza for being very kind, supportive,
and encouraging throughout this process. I am very thankful to Prof. Baochun Li for
his help, guidance, and support in reviewing my work and providing me with his very
helpful feedback to start with. I am also very thankful to Prof. Ben Liang, questions
posed by him helped me to better position the work. His suggestions and comments
refined various aspects of my work. I also want to acknowledge Prof. David Lie and Ms.
Darlene Gorzo for their valuable assistance when it was most needed. Special thanks to
Schlumberger Foundation, and Soptimizer for their support.
I am truly indebted to my parents, and in-laws for their continuous support and
prayers that kept me in high spirits throughout this journey. I owe my deepest gratitude
to my father-in-law who has always been there to provide unconditional support by being
my coach, mentor, counselor, and doctor. I also want to thank my father who envisioned
this path for me and helped me to grow and become what I am today. He has always been
a source of confidence and inspiration for me. I can only imagine the happiness and joy my
late mother would have felt on successful completion of my PhD. Her friendship is always
remembered, and has always been a source of resilience and strength. My mother-in-law,
and my sister-in-laws have always been very supportive of my goals and I thank them
from the core of my heart. I cannot express in words the support that I receive from my
wonderful kids Abd-Ur-Rehman, Marium, and Abd-Ur-Rahim for always understanding
my workload and assisting me in their own little ways. Their involvement and interest in
my work have always sparked extra endurance in my inner-self and brightened my life.
vi
My kids are one of the strongest reasons for me to come so far.
Last but certainly not the least, my PhD was nothing less than a dream if Asad has not
been there. His unreserved friendship, definite support, and outstanding love have shaped
my career. He has not only been my buddy but also my collaborator. He helped me to
carry out my research by transforming our home to into a small research center. He has
been a beacon of happiness and source of inspiration for me. His everlasting commitment
to the family has bound us together as a strong and happy unit. He has always been
there to accompany the kids during my work, and cover for my role—as a mother—so
they are never left unloved even for a moment. His compassion and encouragement has
been a great gift from Allah. I cannot thank Allah, the Almighty, enough for having such
a perfect life partner, and a wonderful family that kept me exuberant no matter what.
Love, support and joy—every bit to enjoy,
Unconditional support to keep me buoy,
Indebted to Asad for helping me to fly,
And reach my dreams high up in the sky
vii
Contents
1 Introduction 1
1.1 Energy Insights of the Ungreen Big Data Enterprise . . . . . . . . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Dissecting Big Data Enterprise . . . . . . . . . . . . . . . . . . . 5
1.2.2 A Coding Based Optimization for Big Data Processing . . . . . . 5
1.2.3 Distributed Power Procurement . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Preliminaries 9
2.1 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Matroid Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Matroid Intersection . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Combiner and In-network Combiner . . . . . . . . . . . . . . . . . 18
3 Five Vital Planes for Green Big Data Enterprise 21
3.1 Server Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Virtualization Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Interconnect Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Frameworks Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Power Procurement Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 24
viii
3.6 Connecting the Dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 A Coding Based Optimization for Big Data Processing 26
4.1 Facing the Challenge: High Volume of Communication and Big Data Pro-
cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Motivating Use-Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Hadoop MapReduce for Electricity Theft Detection . . . . . . . . 34
4.4.1.1 Hadoop Job . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1.2 Standard Mechanism . . . . . . . . . . . . . . . . . . . . 38
4.4.1.3 Proposed Coding Based Mechanism . . . . . . . . . . . . 39
4.4.2 Storm for DNA Sequencing . . . . . . . . . . . . . . . . . . . . . 41
4.4.2.1 Storm Job . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.2.2 Standard Mechanism . . . . . . . . . . . . . . . . . . . . 44
4.4.2.3 Proposed Coding Based Mechanism . . . . . . . . . . . . 45
4.5 Spate Coding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.2 Spate Coding and its Relationship with Network Coding and Index
Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.3 Solution for Spate Coding Problem . . . . . . . . . . . . . . . . . 51
4.6 Challenges and Considerations . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 Theory to Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8.1 Coding Based Middlebox and its Components . . . . . . . . . . . 58
4.8.1.1 Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.8.1.2 Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
4.8.2 PreReducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8.3 Seamless Integration using OpenFlow . . . . . . . . . . . . . . . . 68
4.9 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.9.1 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.9.2 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.9.2.1 Volume-of-Communication . . . . . . . . . . . . . . . . . 80
4.9.2.2 Goodput . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.9.2.3 Disk IO Statistics . . . . . . . . . . . . . . . . . . . . . . 84
4.9.2.4 Bits-per-Joule . . . . . . . . . . . . . . . . . . . . . . . . 85
4.10 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.11 Connecting the Dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Distributed Power Procurement for Data Centers 91
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.1 Example Case Study: Power Procurement over a 12 bus power
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Why Matroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Restricted Constrained Power Procurement Problem . . . . . . . . . . . 107
5.5.1 Relation to Matroids . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.1.1 Policy Constraints . . . . . . . . . . . . . . . . . . . . . 108
5.5.1.2 Transmission Line Constraints . . . . . . . . . . . . . . . 109
5.5.2 A Distributed Algorithm for the 1-RCPP Problem . . . . . . . . . 110
5.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.4 T-Restricted Constrained Power Procurement Problem . . . . . . 116
5.5.4.1 Extended Ground Set . . . . . . . . . . . . . . . . . . . 116
x
5.5.4.2 Policy Constraints . . . . . . . . . . . . . . . . . . . . . 116
5.5.4.3 Transmission Line Constraints . . . . . . . . . . . . . . . 117
5.6 Constrained Power Procurement Problem . . . . . . . . . . . . . . . . . . 119
5.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7.1 An overview of relevant solution techniques . . . . . . . . . . . . . 122
5.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7.3 Percentage procurement and Cost . . . . . . . . . . . . . . . . . . 124
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Concluding Remarks and Path Forward 132
6.1 Open Issues and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Green Orchestrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Appendices 141
A 5Ps: Literature Survey and Discussions 142
A.1 Server Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
A.2 Virtualization Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.3 Interconnect Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A.4 Frameworks Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.5 Power Procurement Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B Proof: A Coding Based Optimization for Big Data Processing 151
B.1 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
B.2 Proof of Theorem 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.3 Proof of Theorem 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.4 Proof of Theorem 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.5 Proof of Theorem 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.6 Proof of Theorem 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
xi
B.7 Proof of Theorem 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
C Proof: Distributed Power Procurement for Data Centers 158
C.1 Proof of Lemma 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2 Proof of Theorem 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Bibliography 160
xii
List of Figures
1.1 Big data: greener optimization. . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 A graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Graph H for iterations 1,2, and 3 during execution of the matroid intersec-
tion algorithm for finding least cost common independent set of maximum
cardinality of two matroids M1 and M2. . . . . . . . . . . . . . . . . . . 16
2.3 A toy word count example demonstrating basic tasks of Hadoop Mapreduce. 18
2.4 A toy word count example demonstrating the use of combiner. . . . . . . 19
2.5 A toy word count example demonstrating the use of in-network combiner. 20
3.1 6Ps of the big data enterprise. . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Topology and mapper/reducer output/input in the electricity theft detec-
tion example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Twelve records emitted by spout for the DNA-sequence processing job. . 42
4.3 Storm topology for DNA sequencing example. . . . . . . . . . . . . . . . 43
4.4 Placement of tasks of bolt A and bolt B in the DNA sequencing example
on a four node Storm cluster. . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 An instance of spate coding for the example 4.4.1. . . . . . . . . . . . . . 47
4.6 (a) An instance of spate coding problem. (b) A similar instance of index
coding problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xiii
4.7 Dependency graph for the transmissions crossing AS for the Example given
in 4.4.1. The execution of the Algorithm 1 results in four multicast packets
shown by four cliques each of size 2. . . . . . . . . . . . . . . . . . . . . 53
4.8 The proposed coding based Hadoop MapReduce data flow. . . . . . . . . 59
4.9 Architecture for middlexbox (sampler, and coder) . . . . . . . . . . . . . 60
4.10 Architecture for preReducer. . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.11 Coding for variable packet lengths. . . . . . . . . . . . . . . . . . . . . . 65
4.12 The proposed packet payload format for the coding based shuffle at the
application-layer. Depending on size, the format shown may be trans-
ported using multiple segments/packets. . . . . . . . . . . . . . . . . . . 67
4.13 OpenFlow based multicasting using coding groups coordinator for the ex-
ample given in section 4.4.1. Flow Tables and the corresponding Group
Tables for each switch in the multicast is also shown. . . . . . . . . . . . 71
4.14 A word count example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.15 Prototype setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.16 Testbed architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.17 Topologies used for testbed: (a) Top-1 (b) Top-2. . . . . . . . . . . . . . 79
4.18 Normalized VoC using Grep benchmark for both topologies Top-1 and
Top-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.19 Normalized VoC using Terasort benchmark for both topologies Top-1 and
Top-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.20 Normalized VoC for different values of λ (maximum clique size) used in
Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.21 Goodput versus link rates for sorting benchmark for topology Top-1. . . 84
4.22 Goodput for different oversubscription ratios using sorting benchmark for
topology Top-1 with link rate at 500 Mbps. . . . . . . . . . . . . . . . . . 85
xiv
4.23 Goodput for different oversubscription ratios using sorting benchmark for
topology Top-1 with link rate at 1000 Mbps. . . . . . . . . . . . . . . . . 86
4.24 %d-util versus link rates using sorting benchmark for topology Top-1. . . 86
4.25 Qsz versus link rates using sorting benchmark for topology Top-1. . . . . 87
4.26 BpJ versus link rates using sorting benchmark for topology Top-1. . . . . 88
5.1 Three data center facilities A1, A2, A3 and a 12 bus power system with
eight generating sources g1, g2 · · · , g8. The constrained transmission lines,
γ1 and γ2, are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 An equivalent representation of the imposed constraints for the 12 bus
case study shown in Figure 5.1. . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Startup delays and generation cost associated with each generating point
for each time unit t0, t1, t2 and a planning horizon of three time units for
the 12 bus case study shown in Figure 5.1. . . . . . . . . . . . . . . . . 96
5.4 An instance of the CPP problem. . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Algorithm 1−RCPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6 Algorithm b = ChkL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.7 Algorithm T−RCPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.8 A sample execution of Algorithm T-RCPP for the system shown in Ex-
ample 5.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.9 Average Percentage procurement α for random power networks as a func-
tion of Transmission Network Density TND for PCF = 2. . . . . . . . 125
5.10 Average Percentage procurement α for random power networks as a func-
tion of Transmission Network Density TND for PCF = 4. . . . . . . . 125
5.11 Average Percentage procurement α for random power networks as a func-
tion of Transmission Network Density TND for PCF = 8. . . . . . . . 126
5.12 Average Procurement Cost per Facility β for random power networks as
a function of Transmission Network Density TND for PCF = 2. . . . . 127
xv
5.13 Average Procurement Cost per Facility β for random power networks as
a function of Transmission Network Density TND for PCF = 4. . . . . 127
5.14 Average Procurement Cost per Facility β for random power networks as
a function of Transmission Network Density TND for PCF = 8. . . . . 128
5.15 Average Percentage procurement α for random power networks as a func-
tion of Procurement Choices per Facility PCF for TND = 30. . . . . . 129
5.16 Average Percentage procurement α for random power networks as a func-
tion of Procurement Choices per Facility PCF for TND = 45. . . . . . 129
5.17 Average Procurement Cost per Facility β for random power networks as
a function of Procurement Choices per Facility PCF for TND = 30. . . 130
5.18 Average Procurement Cost per Facility β for random power networks as
a function of Procurement Choices per Facility PCF for TND = 45. . . 130
6.1 Green Orchestrator for big data processing. . . . . . . . . . . . . . . . . . 137
B.1 An instance of the Problem ES. . . . . . . . . . . . . . . . . . . . . . . . 154
xvi
Chapter 1
Introduction
The tremendous growth in the data generated by different sources is reshaping modern
day life. In fact, more than 90% of all the data in the world has been generated in the
last four years [1]. Big data is revolutionizing many fields including, but not limited to,
health care, law enforcement, financial markets, manufacturing, town planning, online
profiling, and targeted advertisement. The term big data is usually associated with the
following “five V’s” [2, 3]:
• Volume: The volume of the data is predicted to soar as high as 43 trillion gigabytes
by year 2020. This increase in data will be almost 300 times as compared to the
year 2005 due to an addition of 2.3 trillion gigabytes everyday.
• Velocity: The rate at which the data is generated each millisecond is growing
tremendously. For instance, more than 400 billion hours of videos are watched on
Youtube each month, and about 400 million tweets are sent each day, and this rate
is expected to increase day by day.
• Variety: The data is generated by a variety of sources including emails, social
media—e.g., facebook, twitter, and youtube—stock markets, and sensors. This
highly diversified data has left behind an era of a well-behaved, rational, and
1
Chapter 1. Introduction 2
structured data. Almost 80 percent of today’s data does not fit into traditional
databases, and cannot be harnessed by populating highly structured data objects.
• Veracity: It is crucial for the data to be accurate and trustworthy. The applicability
of big data in critical fields—like health care, trade markets, and smart grid, to
name a few—puts even more pressure on its integrity as well as accuracy.
• Value: The persuasion of big data is all about its value churned out from its volume,
velocity, variety, and veracity. Therefore, it is important that the data is not only
collected correctly but also processed wisely to extract valuable information.
The rise of the big data enterprise has opened up several new challenges. Big data so-
lutions require distributed processing and storage across clusters of servers, data centers,
to offer practical benefits. In fact, data centers are the backbone of big data processing.
Big data processing is reported to be very “dirty” in terms of the environmental cost due
to the superfluous energy consumption as well as “ungreen” practices associated with
the operation of data centers [4]. This thesis is focussed on turning data centers into a
greener enterprise. For realizing a greener big data ecosystem, data center efficiency and
power procurement form greener sources should be coupled together [5]. In particular,
turning the big data enterprise into a greener enterprise requires two fundamental steps,
as shown in Figure 1.1. The first step is curtailing resource consumption, and making
data centers energy efficient. The second step is using greener energy sources to power
data centers. Specifically, we explore the following two fundamental approaches as means
towards holistic greener data centers:
• Optimizing big data processing in data centers by curtailing resource consumption
and making the system energy efficient.
• Optimizing the power procurement to power data centers by employing green policy
directives in presence of the system level constraints.
Chapter 1. Introduction 3
Figure 1.1: Big data: greener optimization.
1.1 Energy Insights of the Ungreen Big Data Enter-
prise
Big data processing units, data centers, have shown a tendency to become energy black
holes. This section highlights energy insights with reference to the “ungreen” big data
enterprise.
Data centers are one of the top and leading electricity-consumers in the United States
[6]. A 2008 study showed that around 2% of the world’s total electricity is being consumed
by data centers [4,7]. A more recent report shows that data centers consume around 3%
of the world’s total electricity, and produce 200 million metric tons of carbon dioxide [8,9].
Moreover, electricity consumption by data centers is growing at a rate of 12% per year,
and is about 56% higher than the preceding five years [10–13].
In Europe, energy consumption by data centers is set to go as high as 93 billion
kWh by 2020, which is equivalent to one hundred million 100-watt light-bulbs burning
24 hours a day, 365 days a year [14]. Similarly, in 2013 the power consumption by U.S.
data centers was around 91-billion kWh, which is equivalent to deriving energy from
thirty-four 500-MW coal-fired power plants on an annual basis. This power consumption
is estimated to soar as high as 140 billion kWh per year by 2020, resulting in $13 billion
Chapter 1. Introduction 4
per year in electricity bills and almost 150 million metric tons of carbon emissions [6].
However, soaring electricity bills and staggering carbon emissions have not dispirited the
data center industry from further expansion. More precisely, in the past five years nearly
66% of data center facilities added over 1000 kWh per data center to meet increasing
electricity demands [15].
Moreover, a big challenge is extending current peta-scale computing clusters to exa-
scale computing clusters of the future, where scaling-up current data centers is not an
option due to preposterous power demand. For example, simply scaling up a current
state-of-the-art fastest peta-scale computing platform, like Blue Waters, to an exa-scale
platform will need 1.5 GW of power for each such platform, which is almost 0.1% of
the total U.S. generation capacity required per platform, thus implying that each exa-
scale platform will require a reasonably sized nuclear power plant just to fulfill its energy
demands [16]!
Until 2011, power for more than half of the massive-scale data centers was procured
from coal for up to 80% of their power needs [4]. Industry giants like Apple, IBM, Face-
book were reportedly using more than 50% of the coal generated electricity to power up
their data centres. However, in more recent years, the industry has faced growing pres-
sure to integrate renewable energy sources mainly due to environmental responsibility
campaigns as well as federal legislation (at least in the U.S. [17]) to limit carbon emis-
sions. This has had very favorable impact; for instance, Apple, Box, Facebook, Google,
Rackspace, and Salesforce have committed to incorporate 100% renewable energy sources
to power their data centers [18]. However, there are too many companies that still go for
the economical but not the greenest energy from the grid, e.g., Amazon Web Services,
Youtube, and Netflix [19,20].
Big data frameworks like Hadoop MapReduce [21], Amazon Elastic MapReduce [22],
and Google MapReduce [23] are the prominent growing segment of data centers’ work-
load, used primarily for harnessing the power of analytics [24]. Server clusters running
Chapter 1. Introduction 5
these frameworks consume a lot of energy. Moreover, these frameworks inherently induce
low utilization of the data center resources. As an example, it has been reported that a
200-node Hadoop cluster consumes around 40 kW of energy for a certain workload [25].
Moreover, in a multi-job Hadoop workload, around 38% of the time CPUs, disk, and
network remain inactive [24], thus resulting in unnecessary power drainage.
1.2 Contributions
This thesis revolves around providing greener big data, and gives a comprehensive view
of several “energy-wise” planes and solutions to remediate the resource-hungry big data
enterprise. In particular, we make the following contributions.
1.2.1 Dissecting Big Data Enterprise
Turning big data enterprise into a greener enterprise requires two fundamental steps.
The first is curtailing resources for making the system resource efficient. The second is
using greener energy sources to power it up.
Accordingly, we identify five planes (5Ps) impacting the green footprints of data
centers, namely, server plane, virtualization plane, interconnect plane, framework plane,
and power procurement plane.
1.2.2 A Coding Based Optimization for Big Data Processing
It has been established that the reduction in volume of communication directly corre-
sponds to energy savings in a data center [26–30]. Hence, improving the energy efficiency
by reducing volume of communication is fundamental in bringing green practices to big
data processing.
We propose a system for optimizing big data processing in a data center. Particularly,
the focus is to make the big data processing greener by reducing the volume of com-
Chapter 1. Introduction 6
munication and help making server, virtualization, interconnect, and framework planes
greener. We motivate our work by presenting real world use-cases. Our contributions are
both theoretically and practically significant.
Our major theoretical contributions include analyzing the computational complexity
of a general scheme that tries to minimize volume of communication in a distributed
data center application without degrading the rate of information exchange. In addi-
tion to this, we present theoretical limits of such schemes by providing the upper and
lower bounds. Moreover, we prove that, in contrast to a frequent practice in many data
centers, network bisection is not an optimal location for middlebox placement for some
applications. Furthermore, we formulate a spate coding problem which helps in reducing
volume of communication in big data applications. We also present an efficient solution
for spate coding.
Our major practical contributions include a proof-of-concept implementation in a
data center, and a testbed. We have used Hadoop, the most widely used big data
processing framework, as our target framework. We have used two industry-standard
benchmarks—Terasort and Grep—to evaluate the performance of the proposed system.
The experimental results exhibit coding advantage, for different parameters, compared
to both a vanilla Hadoop implementation and an in-network combiner. In addition, we
have combined our scheme with in-network combiner, referred to as Combine-N-Code, to
further reduce the network traffic.
1.2.3 Distributed Power Procurement
The increasing trend of geographically dispersed data center facilities as well as the rise
of geographically diverse energy pricing over the traditionally shared power networks
is marking a new era of complex power-procurement paradigms. Adding to the com-
plexity are the rising demands for inclusion of policy-related constraints for the power
procurement of “power-hungry” data center enterprises.
Chapter 1. Introduction 7
This thesis presents a model to capture the emerging constraints for the power pro-
curement problem. We have used matroid theory to gain a better insight into the com-
binatorics associated with the problem. Specifically, we consider power procurement in
the presence of both the policy-driven constraints, set forward by data center operators,
as well as the physical constraints imposed by the evolving smart grid.
We have developed a framework for constrained power procurement (CPP), in which
the problem is broken down into building blocks, effective in relating power procurement
complexity to policy-driven as well as physical system constraints. Specifically, we in-
troduce the 1-restricted power procurement (1-RCPP) and the T -restricted constrained
power procurement (T-RCPP) problems. Moreover, we present distributed polynomial
time solutions to the 1-RCPP and T-RCPP problems. Our distributed solutions provide
a novel way to assign IDs to the generating sources to facilitate the distributed power
procurement for data center facilities. Furthermore, we show that for less restrictive
physical constraints, the CPP problem is NP-hard to approximate within a ratio of n1−ε
for any constant ε > 0.
1.3 Thesis Outline
This thesis is organized as follows. Chapter 2 provides some preliminaries. In Chapter
3, we dissect the big data enterprise into five planes vital for turning big data enterprise
greener. Moreover, we provide a detailed overview of the related efforts by academia and
industry. In Chapter 4, we present a novel spate-coding-based scheme that reduces the
volume of communication, and thereby helps make data centers greener. This chapter
also provides theoretical analysis of the schemes reducing the volume of communication,
and provide a detailed framework for Hadoop MapReduce. We manifest detailed insights
covering several performance aspects, using a data center setup as well as a testbed
setup. Chapter 5 models the power procurement problem under green-policy-driven as
Chapter 1. Introduction 8
well as emerging constraints imposed by distributed generation. We provide an equiva-
lent matroid representation of the constraints, and provides a distributed solution. To
conclude, we present open issues and associated challenges in Chapter 6. This chapter
also synthesizes a layout for a futuristic cross-plane green orchestrator.
Chapter 2
Preliminaries
This chapter introduces some of the tools used later in the thesis. Section 2.1 provides
an overview of graph theory and related definitions. Section 2.2 describes basics related
to matroid theory. Section 2.3 presents an overview of the MapReduce framework in
context of Apache Hadoop.
2.1 Graph Theory
Graph theory is a branch of mathematics concerned with the study of graphs. Graphs
are mathematical structures used to model pairwise relations between the objects. The
origin of the graph theory can be traced back to 1736 when Leonhard Euler solved the
Konigsberg bridge problem. Although graph theory has its origin in recreational math,
it has found applications in many disciplines, including but not limited to, computer
science, network theory, chemistry, social sciences, and operations research [31–33].
A formal definition of a graph is given below.
Definition 1 (Graph) An undirected graph G(V,E) is a non-empty finite set V of ele-
ments called vertices, and a finite set E of unordered pairs of elements (u, v) in V called
edges. If the set E consists of ordered pairs, then the graph is a directed graph.
9
Chapter 2. Preliminaries 10
Figure 2.1: A graph.
We will only consider graphs with a finite number of vertices.
Two vertices u and v in G(V,E) are called adjacent if there is an edge (u, v) ∈ E.
Moreover, a weighted graph is a graph for which each edge has an associated weight,
typically given by a weight function w : E → R [34].
Definition 2 (Bipartite Graph) A bipartite graph G(V,E) is a graph in which V can
be partitioned into two nonempty disjoint subsets V1 and V2, such that ∀ (u, v) ∈ E, u ∈ V1
and v ∈ V2.
Next, we introduce the concept of a clique in a graph and the clique packing problem.
Definition 3 (Clique) Given a graph G(V,E), a clique ϑ is a nonempty set of vertices
ϑ ⊆ V such that ∀ u, v ∈ ϑ, u 6= v, (u, v) ∈ E.
A clique ϑ is called a k-clique, denoted by ϑk, if |ϑ| = k, where |ϑ| denotes the number
of elements in ϑ.
For example, a,b,c,d is a 4-clique(ϑ4) in the graph of Figure 2.1. Any single vertex
is a 1-clique.
The following problem will be important later in this thesis.
Definition 4 (Clique Partition Problem) Given a graph G(V,E), partition V into
the minimum number of disjoint subsets ϑ1, ϑ2, . . . , ϑk, such that each ϑi, 1 ≤ i ≤ k, is a
clique.
Clique Partition is an NP-hard problem [35]. One possible clique partition for the
graph shown in Figure 2.1 results in two cliques ϑ1 = a, b, c, and ϑ2 = d, e. Another
Chapter 2. Preliminaries 11
possibility is ϑ1 = a, b, c, d, and ϑ2 = e. Clearly there can be more than one way to
partition a graph into the minimum number of cliques.
Next, we define a path, and a shortest path in a graph.
Definition 5 (Path) A path P between vertices u1 and ux in G(V,E) is a subgraph
P (V ′, E ′) such that
V ′ = u1, u2, · · · , ux,
and
E ′ = (u1, u2), (u2, u3), · · · , (ux−1, ux)
where the ui are all distinct. The length of path P is |E ′|. The weight of path P is∑(u,v)∈E′ w(u, v).
Definition 6 (Shortest Path) A shortest path between vertices u and v is a path of
weight W such that no other path between u and v has weight smaller than W .
For the case where w(u, v) = 1 ∀ (u, v) ∈ E for the graph shown in Figure 2.1, a
shortest path between the vertices a and e is P = a, d, e of weight 2.
2.2 Matroid Theory
Matroids were first introduced by Whitney in the early 20th century [36]. Matroid the-
ory has applications in many disciplines including linear algebra, graph theory, network
theory, coding theory, and combinatorial optimization.
Definition 7 (Matroid) A matroid M(X,L) is an ordered pair formed by a ground
set X and a collection L of subsets of X, that satisfy the following three conditions:
1. ∅ ∈ L;
2. If Y ∈ L and Y ′ ⊆ Y , then Y ′ ∈ L (hereditary property);
Chapter 2. Preliminaries 12
3. If Y1 ∈ L and Y2 ∈ L with |Y1| > |Y2|, then there exists x ∈ Y1 \ Y2 such that
Y2 ∪ x ∈ L (augmentation property).
We will only consider matroids where the ground set is finite.
Each Y ∈ L is referred to as an independent set. A maximal independent subset
of X is referred to as a base. Condition 3 implies that all bases of M have the same
cardinality, referred to as the rank ofM. Elements of X can be associated with a weight
function c : X → R. The weight of a subset Y of X is defined as the sum of weights of
its elements:
w(Y ) =∑x∈Y
c(x).
We next provide the definition of vector matroid and partition matroid [37].
Definition 8 (Vector Matroid) Let E be a set of column labels of an m×n matrix A
over a field F, and let L be the set of subsets S of E for which the multiset of columns
labelled by S is a set and is linearly independent in the vector space Fm. Then (E,L) is
a vector matroid.
For example, consider the binary matrix A.
A =
1 0 0 0
0 1 1 0
0 0 1 1
,in which the four columns are labelled, from left to right, as c1, c2, c3, c4. Then the
set of columns c1, c2, c3 constitutes a maximal independent set or a base.
Definition 9 (Partition Matroid) Let (E1, E2, · · · , Er) be a partition π of a set E
into nonempty subsets. Then (E,L) is a partition matroid when L is defined as:
Chapter 2. Preliminaries 13
L = X ⊆ E : |X ∩ Ei| ≤ ki ∀i ∈ 1, 2, · · · , r,
for given parameters k1, k2, · · · , kr.
Let B consist of those subsets of E that contain exactly ki elements from each Ei.
Then B is the set of bases of the coresponding partition matroid. For example, consider
the set E = a, b, c, d, and a partition of E into two subsets E1 = a, b, E2 = c, d
and let k1 = k2 = 1. Then, L = φ, a, b, c, d, a, c, a, d, b, c, b, d. One of
the bases of this partition matroid is a, d.
Definition 10 (Partial Transversal [37]) Let J = 1, 2, · · · , r be the indices of the
nonempty subsets (E1, E2, · · · , Er) of E. A transversal of (E1, E2, · · · , Em) is a subset
e1, e2, · · · , er of set E such that ej ∈ Ej for all j ∈ J , and e1, e2, · · · , er are distinct.
Then, X ⊆ E is a partial transversal of (Ej : j ∈ J) if, for some subset K of J , X is a
transversal of (Ej : j ∈ K).
For the case when subsets (E1, E2, · · · , Er) is a partition of E, then the set of partial
transversals of E coincides with the set of independent sets of the partition matroid.
2.2.1 Matroid Intersection
Definition 11 (Matroid Intersection) Let M1 = (X,L1), and M2 = (X,L2) be two
matroids over the same ground set X, with families of independent sets L1, and L2. The
set I = L1∩L2 is said to be the intersection of the matroidsM1 andM2. An element of
I is called a common independent set of M1 and M2. An element of I with maximum
cardinality is called a common base of M1 and M2.
Next, we present a slight variation of the weighted common independent set aug-
menting algorithm given in [38] for finding a minimum weight common base, a maximum
cardinality common independent set, of two matroids M1 = (X,L1), M2 = (X,L2).
Chapter 2. Preliminaries 14
1. Initialization: X := ∅,
2. Define a directed graph H with the node set X, and the edge set as follows. For
any xi ∈ X and xj ∈ X \ X:
• If (X \ xi) ∪ xj ∈ L1, (xi, xj) is an edge in H;
• If (X \ xi) ∪ xj ∈ L2, (xj, xi) is an edge in H.
3. Define sets:
X1 = xi ∈ X \ X | X ∪ xi ∈ L1
X2 = xi ∈ X \ X | X ∪ xi ∈ L2
4. For any node xi ∈ X define its cost l(xi) by:
l(xi) = −c(xi) if xi ∈ X
l(xi) = c(xi) if xi /∈ X
The cost of a path m in H, denoted by c(m), is equal to the sum of the costs of
the nodes traversed by m.
5. We consider two cases:
Case 1: There exists a directed path m in H from a node in X1 to a node in X2
• Choose the path m so that c(m) is minimal and it has a minimum number of
edges among all minimum cost paths from a node in X1 to a node in X2
• Let the path m traverse the nodes y0, z1, y1, z2, . . . , yg−1, zg, yg of H, in this
order, and set
X := (X \ z1, . . . , zg) ∪ y0, . . . , yg
Chapter 2. Preliminaries 15
• Go to Step 2
Case 2: There is no directed path in the graph H from a node in X1 to a node in
X2. Then, return X.
Note that the set X returned by the algorithm is a maximum-cardinality common
independent set with least weight.
Below we present a simple example illustrating execution of the matroid intersection
algorithm.
Example 12 Consider two matroids M1 and M2. M1 is a vector matroid and M2
is a partition matroid over the same ground set X = a1, a2, a3, a4, a5. The weights
associated with elements of the ground set are: c(a1) = 1, c(a2) = 2, c(a3) = 3, c(a4) = 1,
and c(a5) = 2. The following binary matrix A, corresponding to M1, represents element
ai as the ith column vector.
A =
1 0 1 1 1
0 1 1 0 1
0 0 0 1 0
.The partition of the ground set for matroidM2 is given by E1 = a1, E2 = a2, a3, E3 =
a4, a5, where k1 = k2 = k3 = 1. The algorithm starts with X := ∅ and iteratively aug-
ments it, such that the invariant X ∈ I is maintained. The corresponding graph H for
each iteration is shown in Figure 2.2, whereas the corresponding sets X1, X2, and X for
each iteration are given below.
First Iteration:
X1 = a1, a2, a3, a4, a5
X2 = a1, a2, a3, a4, a5
Chapter 2. Preliminaries 16
Figure 2.2: Graph H for iterations 1,2, and 3 during execution of the matroid intersectionalgorithm for finding least cost common independent set of maximum cardinality of twomatroids M1 and M2.
X = a1
Second Iteration:
X1 = a2, a3, a4, a5
X2 = a2, a3, a4, a5
X = a1, a4
Third Iteration:
X1 = a2, a3, a5
X2 = a2, a3
X = a1, a4, a2
Chapter 2. Preliminaries 17
The cost for maximum cardinality common independent is 4.
2.3 Hadoop MapReduce
Apache Hadoop [21], an open-source implementation of the MapReduce framework, is
widely used for processing large amounts of data in parallel on thousands of nodes. It
has been shown to scale-out to thousands of nodes and is used across various domains
(from bio-sciences to commercial analytics). More than half of Fortune 50 companies use
Hadoop [39]. Facebook’s Hadoop cluster holds more than 100 Petabytes of data which
grows at a rate of half a Petabyte every day [40].
Following is an overview of the MapReduce workflow for a job.
• The input data set (file) is split into smaller independent chunks to be processed
in parallel across many nodes.
• Each independent chunk (file split) is processed by a Mapper locally at a node.
Each Mapper generates intermediate 〈key, value〉 pairs.
• The role of a Partitioner is to partition intermediate 〈key, value〉 pairs based on
keys.
• During the Shuffle process, the intermediate data is delivered from the Mappers to
the corresponding Reducers via a network-intensive transfer.
• The Sorter merges and sorts the map outputs before being presented to the Reducer.
For any particular key, however, the values are not sorted.
• Each Reducer is responsible for aggregating a unique set of keys and process all
corresponding 〈key, value〉 pairs to generate the output.
• The final output is written back to the disk.
Chapter 2. Preliminaries 18
Figure 2.3: A toy word count example demonstrating basic tasks of Hadoop Mapreduce.
Figure 2.3 depicts the MapReduce process for a toy example focused on the word
count, where the goal of the job is to count the occurrences (values) for each word (key)
in an input file. In this example the input file is divided into three chunks each processed
in parallel by a separate mapper. The mappers outputs (intermediate 〈key, value〉 pairs)
are further processed (aggregated) by two reducers to a smaller set of 〈key, value〉 pairs.
Precisely, the values for the keys cat, man, and the are aggregated at the reducer1,
whereas the values for the keys chases, dog, and rat are aggregated at the reducer2. The
route from a mapper to a reducer consists of two hops. Assuming each 〈key, value〉 pair
can be transmitted by one packet, there is a total of 30 link-level packet transmissions
for 15 〈key, value〉 pairs exchanged during the shuffle.
2.3.1 Combiner and In-network Combiner
MapReduce jobs are typically constrained by the limited bandwidth between different
nodes, so it is important to optimize the data transfer during shuffle [41]. A combiner
function, applied at the output of mappers, helps to reduce the size of data transfer
during shuffle. For instance utilizing a combiner in the toy example of Figure 2.3 results
Chapter 2. Preliminaries 19
Figure 2.4: A toy word count example demonstrating the use of combiner.
in 24 link-level packet transmissions for 12 〈key, value〉 pairs exchanged during the shuffle
(as shown in Figure 2.4). Thus, resulting in 20% reduction in link-level network traffic
compared to standard Hadoop mechanism.
Previous works [42–44] have proposed to use the combiner at the network level to
further reduce the network traffic, we refer to it as the in-network combiner. Later
in Chapter 4, we have compared our proposed solution with the in-network combiner.
Figure 2.5 describes the operation of an in-network combiner for the toy example of
Figure 2.3. Utilizing the in-network combiner results in a total of 21 link-level packet
transmissions; 15 link-level packet transmissions to the network switch and 6 link-level
packet transmissions from the network switch to the corresponding reducers (as shown
in Figure 2.5). Thus, resulting in 30% reduction in network traffic compared to standard
Hadoop mechanism. Note that the in-network combiner also outperforms the combiner.
Chapter 2. Preliminaries 20
Figure 2.5: A toy word count example demonstrating the use of in-network combiner.
Chapter 3
Five Vital Planes for Green Big
Data Enterprise
Curtailing resources and energy consumption by data centers is being overlooked despite
rising energy costs. Part of the reason for this oversight is the tremendous growth of the
big data business, which is pushing energy efficiency down the priority list. A recent data
center industry survey by the Uptime Institute points out that 50% of the North American
respondents (the survey included individuals associated with 1,000 different data centers)
did not recognize energy efficiency to be a very important consideration [15].
The key to bringing down the huge electricity consumption by data centers is to fix
the most prominent factors triggering the power appetite. A report by Cisco points out
that almost 56% of the power consumed by a data center is used for the IT loads, followed
by cooling and distribution [46]. Contrary to the best practices guidelines for achieving
energy efficiency, the vast over-provisioning of IT resources has led to unnecessary energy
consumption as well as inefficient and low utilization of IT equipment. Indeed, it has been
vastly accepted that the most prominent factors negatively impacting the energy usage
of IT loads are low utilization of servers, limited adaptation of virtualization, inefficient
0Part of this work is to appear in [45].
21
Chapter 3. Five Vital Planes for Green Big Data Enterprise 22
Figure 3.1: 6Ps of the big data enterprise.
network interconnects, and lack of appropriate business models [6].
Furthermore, challenges associated with integrating renewable energy solutions for
data centers are a major hurdle for maintaining greener energy portfolios.
In this vein we dissect the big data enterprise into five planes (5Ps) vital to turn
data centers into greener enterprises. These 5Ps are: server plane, virtualization plane,
interconnect plane, framework plane, and power procurement plane plane as shown in
Figure 3.1. Appendix A surveys several key approaches to transform 5Ps into greener
planes.
3.1 Server Plane
According to a research done by Google, a typical server cluster utilization averages
merely 10% to 50% [47]. Since, the computing fabric of any big data processing platform
relies on servers in its entirety, an efficient server plane is very essential for greener big
data processing. The increased workload at the servers bound their efficiency by the
amount of free shared resources both locally and globally. Greener optimization that
results in better management of resources is fundamental for an efficient server plane.
Chapter 3. Five Vital Planes for Green Big Data Enterprise 23
3.2 Virtualization Plane
Virtualization is a technique for sharing physical resources among different Virtual Ma-
chines (VMs). A VM is an emulation of a server with its very own operating system and
applications. A physical server hosts different VMs by deploying a virtualization layer
(hypervisor). Virtualization can lead to significant improvements in energy efficiency as
well as improved server utilization. Virtualization is accepted as the most popular power
management approach used by data centers [6, 48]. The computing resources like CPU
and memory required for each VM are assigned dynamically according to the require-
ments of the applications running on that VM. This dynamic resource allocation helps
to save a considerable amount of energy by maximizing the utilization of free computing
resources among VMs.
Furthermore, virtualization is an effective way to provide server consolidation. In
many scenarios an idle server claims more than 40% of the power used by a fully utilized
server. In such scenarios, server consolidation by using virtualized environments results
in energy savings by switching off the servers that are partially used or are nearly idle [49].
In essence, the virtualization plane is meant to be an aid to achieve energy efficiency in
the distributed computing environment. However, the success of the virtualization plane
counts on better server resource utilization, and intelligent traffic engineering. Hence,
schemes that reshape network traffic as well as free the resources at the compute nodes,
helping them efficiently cope with multiple virtual instances, are a necessity for a greener
virtualization plane.
3.3 Interconnect Plane
The interconnect plane claims a significant proportion of the total energy consumption
by a data center. Moreover, the energy share of the interconnect plane is predicted to
soar further [26,50–52].
Chapter 3. Five Vital Planes for Green Big Data Enterprise 24
High volume of communication, elemental to big data processing, is one of the major
causes for staggering amount of energy consumption associated with the network com-
ponents of data centers. An aspect of bringing energy efficiency to data center networks
is the provision of load-proportional energy consumption in links and switches. In such a
network device, the energy consumption depends on the amount of data flowing through
it [26–30]. Therefore, a greener interconnect plane calls for intelligent communication
schemes that focus on reducing the amount of data flowing through the network devices.
3.4 Frameworks Plane
The core of major big data frameworks is to divide huge data sets into smaller subsets,
process them at separate nodes, communicate the processed information over the network
connecting different nodes (also referred to as the communication phase), and then com-
bine the results in a time-efficient manner. Therefore, efficiency of the framework plane
and infrastructure are tightly coupled. Therefore, a solution applicable to the general
spirit of big data processing, i.e., processing in parallel across a large number of nodes, by
optimizing communication among servers is very crucial for greener framework planes.
3.5 Power Procurement Plane
Efficient power procurement for the data center entities constitute several components
including power procurement, proper business model, and greener portfolio management.
Power procurement plane is a complex plane with socioeconomic multi-party interplay,
therefore an important step for reshaping power procurement plane is to investigate the
basic models that can capture distributed generation in the presence of preferential green
energy choices and the power transmission infrastructure constraints.
Chapter 3. Five Vital Planes for Green Big Data Enterprise 25
3.6 Connecting the Dots
This chapter dissected the big data enterprise into five pivotal planes with the perspective
of energy efficiency. Chapter 4 presents a solution to optimize the communication in
distributed data center applications. We want to emphasize that although our focus
is reducing the volume of communication, the performance evaluation of the proposed
solution exhibits a performance improvement in all four planes internal to a data center,
i.e., server plane, virtualization plane, interconnect plane, and framework plane. Chapter
5 focuses on optimizing an important aspect related to the power procurement plane.
Chapter 4
A Coding Based Optimization for
Big Data Processing
The rise of the cloud and distributed data-intensive (“big data”) applications puts pres-
sure on data center networks due to the movement of massive volumes of data. Reducing
the volume of communication is pivotal for embracing greener data exchange by efficient
utilization of network resources. This chapter proposes the use of a mixing technique,
spate coding, as a means of dynamically-controlled reduction in the volume of communi-
cation. We introduce motivating real-world use-cases, and present a novel spate coding
algorithm for data center networks. We also analyze the computational complexity of
the general problem of minimizing the volume of communication in a distributed data
center application without degrading the rate of information exchange, and provide the-
oretical limits of such schemes. Moreover, we proceed to bridge the gap between theory
and practice by performing a proof-of-concept implementation of the proposed system
in a real world data center. We use Hadoop MapReduce, the most widely used big data
processing framework, as our target. The experimental results employing two of industry
standard benchmarks show the advantage of our proposed system compared to vanilla
0Part of this work appears in [53–55].
26
Chapter 4. A Coding Based Optimization for Big Data Processing 27
Hadoop implementation, in-network combiner, and Combine-N-Code. The proposed cod-
ing based scheme shows performance improvement in terms of volume of communication,
goodput, disk utilization, and the number of bits that can be transmitted per Joule of
energy.
4.1 Facing the Challenge: High Volume of Commu-
nication and Big Data Processing
Cicso global cloud index predicts that by 2017 cloud traffic, traffic by data centers as
well as other entities in cloud, will represent sixty-nine percent of data center traffic [56].
Furthermore, recent business insights into data center networks evolution also forecast
an unprecedented growth in data center traffic with 76% of the aggregate traffic not
exiting the data center [56]. The increasing migration of applications to the cloud is
probably the major driver of this trend: even without a fundamental change in the
average communication/computation ratio exhibited by data center workloads, increasing
the compute-density-per-server by means of virtualization leads to proportional increase
in traffic generated per server. Orthogonally though, due to the effects of virtualization,
the onset of applications crunching and moving large volumes of data—captured by the
market-coined term big data applications—is also foreseen to significantly contribute to
higher network traffic within data centers [57].
The burden of high volume of communication, elemental to big data processing, is
anchored to network components like switches, and routers. With the advent of energy
proportional computations at servers, the energy quota for network components is pre-
dicted to soar as high as 50% of the total energy consumption by data centers [26]. Data
centers’ energy consumption related to the network was reported to be around 15.6 billion
kWh in year 2008 [7].
Energy proportional network components play a vital role in reducing a data center’s
Chapter 4. A Coding Based Optimization for Big Data Processing 28
energy consumption. In energy proportional network components, the energy consump-
tion is proportional to the volume of communication. Reduction in volume of communi-
cation directly corresponds to energy savings in a data center [26–30]. Furthermore, some
approaches conserve data center network’s energy consumption by adaptively choosing
the link rates to match the volume of traffic going through network component. For in-
stance, an energy saving of approximately 4 Watts per link can be achieved by decreasing
the link rate from 1 Gbps to 100 Mbps [58]. In such adaptive approaches, decreasing the
volume of communication improves the energy profile by favoring lower link rates. Hence,
the opportunity to improve the energy efficiency by reducing volume of communication
serves as a cornerstone in bringing green practices to big data processing.
Even with the provision of energy-efficient network architectures, current data center
network-architectures pose a serious challenge towards embracing big data in an effi-
cient and greener fashion. Specifically, data center networks are still not ready to digest
petabytes of network data traffic crossing the bisection [59, 60]. Processing data at this
speed needs a network with enough bisectional bandwidth where every server (node) can
send data at full speed to every other server (node). This means that network bisection
bandwidth is going to cap the rate at which different servers can communicate with each
other [61]. Unfortunately, the data center networks are optimized for North-South traffic
(connections between servers and clients) rather than East-West traffic (servers commu-
nicating with each other within data center) since most of the software architectures
in pre-cloud era were meant for clients accessing the servers within data centers. How-
ever, with the rise of cloud and distributed architectures like Hadoop MapReduce [21],
Storm [62], and Dryad/DryadLINQ [63,64] that are implemented on the different nodes
across the data centers, applications rely on East-West traffic for most of their com-
putations. In fact, as a defacto standard, data centers use a tree topology assuming
that network bisection (top layer of the network which is typically highly oversubscribed
(240:1) offers enough bandwidth for the machines at the lower levels in the network topol-
Chapter 4. A Coding Based Optimization for Big Data Processing 29
ogy [60]. However, most of the distributed big data applications like Hadoop MapReduce
exchange tremendous amounts of data over highly oversubscribed links resulting in high
packet-drop rates [59]. Therefore, unfolding the full potential of big data by greener data
centers calls for efficient utilization of network resources, and new non-intrusive frame-
works that can seamlessly integrate with existing network architectures. Consequently,
such new frameworks are expected to heavily rely on managing network traffic.
The main driving force for the adoption of cloud data centers is rooted in its ability to
deliver services and data at a faster rate, resulting in improved application performance
as well as higher operational efficiencies. In order to achieve the desired performance,
in terms of bringing down the job-completion times of applications and enhancing the
operational characteristics of the applications, most of the research is focused on schedul-
ing computation, jobs, and resources (e.g., see [65–67] and references therein). However,
since more than 50% of the job completion-time of many jobs is consumed in the com-
munication phase [57], it is essential to explore novel means by which the communication
time can be reduced. In this context different approaches have been studied for ame-
liorating the network effects due to communication-intensive workloads by managing
flows [57,68,69], and demand-driven path alterations [70–72]. Bringing greener prospects
in the data center networks are as important as ensuring higher operational efficiency.
In this vein this work focus on presenting an efficient way to manage the network traf-
fic by minimizing its volume which results in improved resource usage and their energy
footprints. Our work is complementary to existing approaches. We argue that gauging
towards the right gains is of fundamental importance. For example, some approaches
related to optimizing big data frameworks focus on job completion time, whereas we
assert that optimizing resource utilization is more valuable when it comes to energy ef-
ficiency. For example many Hadoop jobs are batch jobs for which time efficiency is not
of considerable importance; for example overnight log analysis to capture trends is not a
job that is critical on time. In such scenarios, if the perspective of optimization is shifted
Chapter 4. A Coding Based Optimization for Big Data Processing 30
from time towards resource utilization, the energy-gains can be improved.
4.2 Contributions
In particular, we propose a scheme for optimizing big data processing in a data center,
and present motivating use-cases.
Our major theoretical contributions involve analyzing the computational complexity
of a general scheme that tries to minimize volume of communication in a distributed data
center application without degrading the rate of information exchange. We then present
theoretical limits of such schemes by providing upper and lower bounds. Moreover, we
have proven that in contrast to a frequent practice in many data centers and high speed
networks, the network bisection is not an optimal location for middlebox placement
for some applications. Furthermore, we explore the potential of integrating the novel
concepts and techniques of spate coding into big data processing frameworks. We propose
to exploit a fundamental property common to most big data frameworks, i.e., sharing
of a physical node among several processes giving rise to side information. This side
information is generated as a consequence of availability of a process’s output to other
processes hosted by the same node without going through the network. The concept
of spate coding is inspired by the index coding problem [73, 74], though these two are
different problems with different solution spaces. The core idea in spate coding is to
mix (encode) packets, while leveraging side information, with the objective to reduce the
overall volume of communication. In contrast to index coding, spate coding does not
necessarily consider a broadcast environment. We have presented an efficient solution for
spate coding.
Our major practical contributions include a proof-of-concept implementation of the
proposed system in a real world data center, and a testbed. Although our proposed solu-
tion is general in nature, for demonstration purposes, we have used Hadoop MapReduce
Chapter 4. A Coding Based Optimization for Big Data Processing 31
as our target framework. We focus on the shuffle phase of Hadoop, which is known to
be the communication-intensive phase of the application [75]. We have evaluated the
proposed approach by utilizing, Terasort and Grep, two industry-standard benchmarks.
We have not only compared the proposed approach with vanilla Hadoop implementation,
but also with the in-network combiner (see Section 2.3.1 for background on in-network
combiner). The performance evaluation parameters include, volume of communication,
goodput, disk utilization, queue size, and the number of bits that can be transmitted per
Joule of energy. Moreover, we have combined the coding based scheme with in-network
combiner, called Combine-N-Code, to further decrease the network traffic.
4.3 Related Work
Several studies have been conducted to optimize data transfer during the communication-
intensive phase of big data frameworks. These studies can be classified into three broad
thrusts, namely traffic reduction, traffic engineering, and data center architectures.
Network traffic reduction is an important aspect of network optimization. Redun-
dancy elimination schemes identify and remove repeated content from network transfer
(e.g., [76–78]) for increasing end-to-end application throughput. MapReduce jobs are typ-
ically constrained by the limited bandwidth between different nodes, so it is fundamen-
tally significant to optimize the data transfer during shuffle [41]. A combiner function,
applied at the output of mappers, helps to reduce the size of data transfer during shuffle.
Recently, it has been proposed to gain further advantage from the combiner by extending
the combining operation to the network core (see e.g., Section 2.3.1). The in-network
combiner has been applied both at the rack level [44], and network-wide [42], [43].
The traffic engineering thrust is focused on reshaping the network traffic to improve
network performance by utilizing scheduling, routing and prediction based schemes. For
example, the scheme proposed in [57] focuses on network-state aware scheduling by dy-
Chapter 4. A Coding Based Optimization for Big Data Processing 32
namically manipulating flow rates. Moreover, [79, 80] focus on traffic-flow-prediction
based scheduling to reduce the impact of network transfers, and improve job processing
to optimize network traffic. Furthermore, the scheme in [81] improves the bandwidth
utilization by making use of unused bandwidth by breaking down bulky (elephant) flows
into mice flows to counter inherent problems with equal cost multipathing.
To accommodate the growing needs of big data applications that call for tremendous
network level transfers within a data center many novel data center architectures have
been proposed. In particular, redesigning the network to expand the bandwidth among
servers has steered the design of full bisection data center networks [60,82,83].
Furthermore, there are following two broad categories of applications where network
coding has been used in reducing network traffic: one, traffic reduction for wireless
networks including sensor networks, two, traffic reduction for repairing the failed nodes
in distributed storage. With respect to wireless networks, network coding has been
shown an ability to reduce number of transmissions by taking advantage of opportunistic
listening [84–87]. For instance, COPE is a wireless routing network protocol which takes
advantage of network coding to reduce the network transmissions by taking advantage of
opportunistic listening and coding [88]. Moreover, a similar application is code updates
during early development stage of the wireless sensor networks [89]. With regard to
application of network coding in distributed storage, where the network bandwidth is
a critical resource, repairing a failed storage node using network coding can reduce the
network traffic needed for repair [90–92].
Comparison to Related Work
In comparison to all the related work, which treats data flow as commodity flow (routing),
we utilize the novel concept of mixing (coding), and it has been proven that the mixing
at the network nodes can provide the optimal rate of information exchange in many
Chapter 4. A Coding Based Optimization for Big Data Processing 33
scenarios where the routing can not [93]. Furthermore, it has been shown that the mixing
techniques such as index coding can significantly increase the rate of information transfer
in a network [74,94,95]. Still, index coding [73,74,94–97] assumes a broadcast channel, an
assumption that is far from reality in a data center. We therefore propose spate coding
to improve the rate of information transfer in a non-broadcast environment, as is the
case with a data center network. We want to point out that the solution techniques as
well as theoretical results for index coding do not hold for spate coding.
Furthermore, the usability of traffic reduction techniques focusing on an in-network
combiner is limited to the jobs where reduce task is commutative and associative e.g.,
max, min, sum. For the reduce tasks that do not satisfy commutative and associative
properties, e.g., median, use of combiners yield incorrect results [42], [43]. In contrast,
our proposed scheme is not limited to the type of aggregation function and provides a
greater reduction in the volume of communication as compared to in-network combiner.
Moreover, in contrast to traffic reduction approaches based on the redundancy elim-
ination schemes where intermediate nodes are required to be involved in the decoding
process in our scheme only the destination nodes perform the decoding. Furthermore, our
scheme provides an instantaneous encoding and decoding of the packets. Additionally,
our scheme can work in conjunction with the redundancy elimination schemes.
Moreover, the schemes belonging to the traffic engineering thrust can work in con-
junction with our proposed scheme. We want to point out that the traffic engineering
based schemes count on the knowledge of demand-profiles, whereas in many scenarios
demand-profiles are either unknown or hard to predict. In addition to this, flow-based
prediction and scheduling has an inherent problem for bursty traffic patterns common
to big data frameworks, like Hadoop, where flows can be short lived. In contrast, our
proposed scheme is oblivious to the knowledge of demand-profiles, scheduling, and un-
derlying routing protocols.
Additionally, our scheme reduces the network traffic without modifying the data
Chapter 4. A Coding Based Optimization for Big Data Processing 34
center architecture and can prove to be a less costly alternate to proposals that calls
for fundamental design shift.
To the best of our knowledge, our work is first to propose the use of coding in the
communication intensive phase of big data applications to reduce network traffic and is
different from both the wireless scenarios as well as repairing a failed storage node. More
specifically, network coding for wireless scenarios is the index coding problem counting
on opportunistic listening; Section 4.5.2 compares spate coding with index coding as well
as inapplicability of a solution developed for a broadcast medium to a non-broadcast
medium. On the other hand, the storage node repair problem deals with the recovery of
a piece of file which is stored in coded format, whereas we focus on information exchange
happening during the communication phase of distributed application and is not related
to recovering lost data. Furthermore, the problem of repairing of a storage node while
reducing network traffic can be transformed into classical single-source multicast network
coding problem, a polynomial-time problem, and Section 4.5.2 compares it with spate
coding problem which is an NP-hard problem.
4.4 Motivating Use-Cases
Before delving into system specification details, we present two motivating use-cases using
real, albeit toy, examples. The examples showcase the use of mixing (coding) in big data
applications, and its potential in reducing volume of communication. The first example
is focussed on Hadoop MapReduce, whereas the second one is focussed on Storm which
can be thought of Hadoop’s counterpart for stream processing.
4.4.1 Hadoop MapReduce for Electricity Theft Detection
This example shows the advantage of the proposed mechanism for Grep (Global Regular
Expression) applications. Grep is a fundamental use case used to benchmark big data
Chapter 4. A Coding Based Optimization for Big Data Processing 35
farmeworks. In fact, it is one of the core use cases taken up by Google for showing the
productivity of MapReduce architecture [23].
The example highlights a use-case for detecting unusual high consumption, and unac-
counted for electricity consumption in an area by using Hadoop MapReduce framework.
Threshold detection, compared to the base load profile, is one of the methods to
detect non-technical losses and electricity theft [98]. This example highlights a use-case
for detecting atypically high electricity consumption by utilizing the Hadoop MapReduce
framework. The Hadoop job scans the data to count the number of times the power
consumption was higher than a threshold. The data used in this example is regenerated
and anonymized (for privacy, and confidentiality reasons) from real world smart meter
data records accessed via the Irish Social Science Data Archive [99], where the readings
were taken every 30 minutes from 5000 smart meters for a period of two years.
Each smart meter reading consists of the following three components: meter ID, a
day and time code specifying the day of the year as well as the time—interval—of the
day, and the power consumed.
As described in Section 2.3, a MapReduce task usually splits the input into indepen-
dent chunks which are first processed by the mappers placed across different data center
nodes in a completely parallel manner. The outputs of the mappers are then commu-
nicated to the reducers which are also placed across different nodes, during the shuffle
phase, for further processing to complete the job.
We note that in a typical real world Hadoop cluster a node hosts mutiple map-
pers and reducers [41], hence many mappers and reducers share common information.
Namely, a data center node (server) often hosts both a mapper and a reducer, whereby
the intermediate 〈key, value〉 pairs stored by the mapper on local storage are accessi-
ble to the reducer running on the same server without the need for going through the
network. This gives rise for ample opportunity to code the packets exchanged during
the shuffle phase. Placement of mappers and reducers on a node is tackled by the
Chapter 4. A Coding Based Optimization for Big Data Processing 36
Figure 4.1: Topology and mapper/reducer output/input in the electricity theft detectionexample.
tasktracker. A tasktracker is configured for a fixed number of slots both for mappers
and reducers. By default, a tasktracker runs multiple mappers and reducers on a node
simultaneously [41]. Under default settings, a tasktracker has two slots for mappers
and two slots for reducers. However, the maximum number of mappers and reduc-
ers per node can be controlled by setting mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum respectively [41]. As a general practice the
number of mappers and reducers is set equal to the number of cores in a node.
Chapter 4. A Coding Based Optimization for Big Data Processing 37
4.4.1.1 Hadoop Job
We consider a four-node Hadoop cluster in a standard data center network architecture
as shown in Figure 4.1.The objective is to count the total number of instances where
the smart meter with ID 1400 reported the power consumption exceeding 0.5 for four
different time code and day code considered for this example, i.e., 11518 (April 25th for
30 minute interval starting at 8:30 am), 35010 (December 16th for 30 minute interval
starting at 4:30 am), 00120 (January 1st for 30 minute interval starting at 9:30 am),
20513 (July 24th for 30 minute interval starting at 6:00 am). The objective is to count
the total number of instances when the smart meter with ID 1400 reported the power
consumption exceeding a baseline of 0.5 for the following day and time codes: 11518,
35010, 00120, 20513.
A MapReduce task usually splits the input into independent chunks which are first
processed by the mappers placed across different data center nodes in a completely par-
allel manner. The outputs of the mappers are then communicated to the reducers which
are also placed across different nodes, during the shuffle phase, for further processing to
complete the job.
The input log of smart meter records is divided by Hadoop into four splits (split1, · · · , split4).
A split is stored on one of the four DataNodes (node1, · · · , node4) subscribed to the
Hadoop Distributed File System (HDFS), where we assume that spliti is stored on nodei.
Eachmapperi residing on nodei parses file spliti and emits the corresponding 〈key, value〉
pairs, where a key is the smart meter ID along with associated day and time code, and
a value is the number of times the power consumption exceeds 0.5. In this job the role
of the map function is to only extract the parts of the smart meter data log containing
the information about the meter ID 1400 for one of the four targeted times and dates.
Before being stored, intermediate 〈key, value〉 pairs output by each mapper are parti-
tioned such that each reducer receives (fetches) pairs corresponding to only one key. The
reduce function processes the 〈key, value〉 pairs emitted by the mappers and counts the
Chapter 4. A Coding Based Optimization for Big Data Processing 38
total frequency where power consumption exceeds the threshold; for this, four reducer
instances are created one at each of the four nodes DataNodes (node1, · · · , node4). Due
to the partitioning strategy, each reducer (reduceri on nodei) processes the 〈key, value〉
pairs emitted by the mappers and counts the total frequency where power consumption
exceeds the threshold corresponding to a single key as shown in Figure 4.1 (e.g., reducer2
residing on node2 is responsible for counting the 〈key, value〉 pairs with all keys equal to
“1400 35010”).
During the shuffle phase the mappers’ outputs (P1, · · · , P12 as shown in Figure 4.1)
are delivered to the corresponding reducers, e.g., the mappers outputs with the key 1400
11518 —P4, P8, and P12—are delivered to the reducer1. Without loss of generality, we as-
sume that a 〈key, value〉 pair is communicated by a mapper to a reducer through a single
packet transmission. Specifically for our example P1, · · · , P12; where Pi = (1400 20513, 1)
for i = 3, 5, 7; Pj = (1400 00120, 1) for j = 2, 6, 10; Pk = (1400 35010, 1) for k = 1, 9, 11;
and Pn = (1400 11518, 1) for n = 4, 8, 12. Packets P4, P8, and P12 are to be delivered to
reducer1. Packets P1, P9, and P11 are to be delivered to reducer2. Packets P2, P6, and
P10 are to delivered to reducer3. Packets P3, P5, and P7 are to be delivered to reducer4.
4.4.1.2 Standard Mechanism
Using standard Hadoop mechanism, it is easy to find that a total of 10 link-level packet
transmissions are required per reducer to fetch all of its desired 〈key, value〉 pairs from
respective mappers. For example reducer1 residing on node1 is responsible for counting
the 〈key, value〉 pairs with all keys equal to “1400 11518”, this process results in 10
link-level packet transmissions which can be calculated as below:
• Two link-level packet transmissions to fetch packet P4 from mapper2 residing on
node2,
• Four link-level packet transmissions to fetch packet P8 from mapper3 residing on
node3,
Chapter 4. A Coding Based Optimization for Big Data Processing 39
• Four link-level packet transmissions to fetch packet P12 from mapper4 residing on
node4.
It follows then due to symmetry that 40 link-level packet transmissions are required
to complete the shuffle phase. Note that a total of 16 link-level packet transmissions
cross the network bisection (AS − S1, AS − S2) during the shuffle phase.
4.4.1.3 Proposed Coding Based Mechanism
We note here that a reducer residing on the same node with a mapper has access to this
mapper’s output without going through the network; for instance, output of mapper1 is
locally available to reducer1. We call this locally stored and accessible information as the
side information available to a reducer, e.g., the side information of reducer1 is P1,P2
and P3.
Leveraging on this side information, we propose coding the packets, whereby the
coding refers to applying a simple XOR(⊕) function to a set of input packets. In this
example, the coding is employed at the bisection, using middlebox attached to the L2−
aggregate switch AS as shown in Figure 4.1. The coding results in only two packets
P2 ⊕ P5 ⊕ P8 ⊕ P11, and P3 ⊕ P6 ⊕ P9 ⊕ P12 crossing the bisection. Each reducer can
use its side information to decode its required packets from the coded packet it receives
as explained below. For example, when reducer1 receives P2 ⊕ P5 ⊕ P8 ⊕ P11, it can
decode its required packet P8 by XORing the received packet with the packets it already
has (P1,P2 and P3), i.e., P8 = (P2 ⊕ P5 ⊕ P8 ⊕ P11) ⊕ P1 ⊕ P2 ⊕ P3. This follows from
the fact that although P5 = (1400 20513, 1) and P3 = (1400 20513, 1) are generated by
different mappers, they contain the same information and hence P5 ⊕ P3 = 0. Similarly,
P1 ⊕ P11 = 0 since P1 = (1400 35010, 1) and P11 = (1400 35010, 1).
To explain the decoding process in an intuitive way, we introduce ≡ denoting equality
among packets. Specifically, two different packets are equal if they convey the same
information. For example although packets P4,P8, and P12 are different packets but as
Chapter 4. A Coding Based Optimization for Big Data Processing 40
they carry the same information (1400 11518,1), we represent these by an equal packet
A, i.e., P4 ≡ P8 ≡ P12 := A. Similarly, P1 ≡ P9 ≡ P11 := B, P2 ≡ P6 ≡ P10 := C, and
P3 ≡ P5 ≡ P7 := D. Let’s explore how reducer1 would be able to decode its required
packet fetched from mapper3. We start by focussing on the coded packet P2⊕P5⊕P8⊕P11
received by reducer1:
P2 ⊕ P5 ⊕ P8 ⊕ P11 ≡ C ⊕D ⊕ A⊕B
Then utilizing side information available at reducer1:
(P2 ⊕ P5 ⊕ P8 ⊕ P11)⊕ P1 ⊕ P2 ⊕ P3 ≡ C ⊕D ⊕ A⊕B ⊕B ⊕ C ⊕D
= A ≡ P11 = (1400 11518, 1),
i.e., the packet required by reducer1 from mapper3.
Together with the packet exchange occurring via the access switches and the trans-
mission of the packets input to the coding function for them to reach the point of coding
(aggregation switch in our example), we find that in this case a total of 36 link-level
packet transmissions are required to complete the shuffle phase. More specifically:
• Each of the four reducers fetches one record each from the DataNode one-switch
away connected via L2− switch S1(S2) incurring 2 link-level packet transmissions
per reducer, and a total of 8 link-level packet transmissions for all the reducers,
• 16 link-level packets transmissions, four for each of the four DataNodes to L2 −
Aggregate switch(AS),
• A total of 12 link-level packets transmissions to deliver both the coded packets from
L2− Aggregate switch(AS) to all the DataNodes.
Note that by using coding a total of 12 link-level packet transmissions cross the
Chapter 4. A Coding Based Optimization for Big Data Processing 41
network bisection during the shuffle phase, i.e., compared to baseline Hadoop implemen-
tation a 25% reduction in network bisection traffic. The use of identical values for each
distinct key generated by a mapper in this example — picked deliberately to ease pre-
sentation — favours the efficiency of coding, but obviously may not hold for production
Hadoop computations. We generalize the concept of the coding-based shuffle beyond this
simplifying assumption in Section 4.8.1.
4.4.2 Storm for DNA Sequencing
This example introduces the concept of partial coding, and shows how coding can be
generalized to various 〈key, value〉 pair patterns, and is oblivious to the semantics. In this
example, we focus on Storm (event processor) [62], a distributed real-time computation
framework for processing unbounded data streams. Apache storm is an emerging big
data platform used by Taobao, Ooyala, Infochimps, Weather channel, and Groupon.
The core abstraction in Storm is the stream, which is a continuous sequence of tuples.
The basic components of Storm for the provision of stream transformations are spouts
and bolts. A spout can be considered as a source of streams, whereas a bolt processes
a number of input streams and emits the transformed streams. Spouts and bolts are
connected through a topology, a connectivity graph, which is an abstraction of how the
stream transformation takes place. In fact, both the Storm cluster and the Hadoop cluster
are very much alike except that Hadoop works on a MapReduce job while Storm works
on topologies of spout and bolts. Spout and bolt tasks are spread across the Storm cluster.
Storm topology represents an arbitrarily complex multi-stage stream computation [100].
Storm is used along with Cassandra [101] as an external database in order to preserve
the bolts state in the event of a failure.
To demonstrate the versatility of our proposed solution, we consider a network archi-
tecture that is different than the one considered in the Example 4.4.1. More specifically,
we consider a network architecture that is associated with Split multi-link trunking (IEEE
Chapter 4. A Coding Based Optimization for Big Data Processing 42
Figure 4.2: Twelve records emitted by spout for the DNA-sequence processing job.
802.3ad) [102], NIC (network interface controller)-teaming with adaptive load balancing
(ALB) [103], and load balancing on servers where different virtual machines assigned to
different NICs.
4.4.2.1 Storm Job
The Storm job considered is a DNA-sequencing code processing samples of short DNA
sequence records. DNA sequencing is elemental for cancer treatment and is one of the
major big data applications in the health sector [104]. Each DNA-sequence record consists
of the following two parts: a short DNA sequence (value), and a unique ID associated
with this sequence (key). The objective of this job is to cluster short DNA-sequences
based on the second digit of their unique ID. For demonstration purposes, we assume
that the input data for this DNA-sequence processing job contains only twelve records
as given in Figure 4.2.
Figure 4.3 shows the Storm topology for the DNA-sequencing job. The topology
consists of one spout, and two bolts bolt A and bolt B where each bolt has four tasks.
The output from the spout is fed to the four tasks of bolt A. The bolt A emits modified
streams through NIC eth0. These modified streams are fed to the bolt B, via NIC eth1,
Chapter 4. A Coding Based Optimization for Big Data Processing 43
Figure 4.3: Storm topology for DNA sequencing example.
Figure 4.4: Placement of tasks of bolt A and bolt B in the DNA sequencing example ona four node Storm cluster.
Chapter 4. A Coding Based Optimization for Big Data Processing 44
which is also running four tasks. The tasks of bolt A and bolt B are running on the four
DataNodes (node1, · · · , node4) as shown in Figure 4.4 subscribed to Cassandra; taskki
represents the task of bolt k running on nodei. Furthermore, the stream grouping for
bolt A is shuffle grouping [105], which distributes the tuples to its tasks in a random
fashion. The stream grouping for bolt B is fields grouping based on the second digit of
the key, which ensures clustering the short DNA-sequences based on the second digit of
their unique ID. Specifically, the tuples from output streams of bolt A with the second
digit of the key i will be delivered to the taskBi+1 (e.g., the output of of bolt A where
second digit of the key is 0 will be delivered to the taskB1 ). Without loss of generality,
we assume that a 〈key, value〉 pair is transmitted from bolt A to bolt B through a single
packet transmission, and the spout’s output is such that P1, · · · , P3 are received by taskA1
, P4, · · · , P6 are received by taskA2 , P7, · · · , P9 are received by taskA3 , and P9, · · · , P12 are
received by taskA4 .
4.4.2.2 Standard Mechanism
We proceed to analyze this scenario from the perspective of standard Storm mechanism.
Assuming each link to have same capacity, it is easy to see that the links on the path
from switch S1 to S2 i.e., S1 − AS − S2, are the bottleneck links. Calculating the
total link utilization incurred by a taskAi to communicate the transformed streams to
the corresponding tasks of bolt B, we easily find that a total of 12 link-level packet
transmissions are required using standard Storm mechanisms. It follows then due to
symmetry that 48 link-level packet transmissions are required to complete the transfer
from all tasks of bolt A to the corresponding tasks of bolt B. Note that a total of 24
link-level packet transmission crosses the network bisection.
Chapter 4. A Coding Based Optimization for Big Data Processing 45
4.4.2.3 Proposed Coding Based Mechanism
By observing the output tuples from the tasks of bolt A, it is obvious that in contrast
to Example 1, coding on the basis of an entire 〈key, value〉 pair will be of no advantage
in this example. This follows from the fact that each of the P1, · · · , P12 has a unique
Key, hence no task of the bolt B has sufficient side information to decode a subset of
entire packets coded together. Therefore, we exploit the novel concept of partial (a.k.a.
fractional) coding. At S1, the L2-Switch of the toy network topology in Figure 4.4,
the coding is performed only on the portions of the packets (second digit of the key
and complete value). More specifically, the coded packets are: Ω1 ⊕ Ω4 ⊕ Ω7 ⊕ Ω10,
Ω2 ⊕Ω5 ⊕Ω8 ⊕Ω11, and Ω3 ⊕Ω6 ⊕Ω9 ⊕Ω12, where Ωi represents part of Pi comprising
the second digit of the key and complete value; for example Ω1 is highlighted in a block
in Figure 4.2. Each of these coded data are then combined, by concatenation, with their
corresponding raw (i.e., not coded) portions to form a partially coded packet. Further
details on the format of a partially coded packet are given in Section 4.8.1.
Following similar arguments used in Section 4.4.1, it can be shown that each task of
the bolt B can decode its required 〈key, value〉 pairs. It is interesting to note that in this
scenario , where coding is performed only on a part of packet excluding first digit of the
keys, the coding-based network transmissions results in significantly—namely 75%—less
utilization of network bisection links, while maintaining the same amount of information
transfer as without coding. Depending on the use-case, this translates to 75% more
Storm jobs running simultaneously, or to a 75% decrease in job completion time if the
links between AS − S1 and AS − S2 are congested.
Chapter 4. A Coding Based Optimization for Big Data Processing 46
4.5 Spate Coding Problem
4.5.1 Problem Formulation
This section formalizes the concept of coding by defining spate coding problem. While
spate coding has flavours similar to network coding and index coding, Section 4.5.2 high-
lights the differences.
Definition 13 (Spate coding problem) An instance of spate coding problem is de-
fined by a coding server s, a set of clients C = c1, . . . , cm, a set of equal length packets
P = p1, . . . , pn, and a network topology specified by a tree H(V,E) rooted at s. A
certain subset of packets is to be delivered to each ci, via s, known as its Want set
Wi ⊆ P . Each ci has access to some side information known as its Has set Hi ⊆ P .
Let (X1, X,X2) be a partition of V , then X1 = s, vertices in X = v1, · · · , vk host
clients, and vertices in X2 are the dumb vertices—which can only forward or route the
packets. The coding server can transmit and route the packets in P , or their combinations
(packets coded over a finite field) over the edges (s, vi) ∈ E, where vi ∈ X ∪X2. The goal
is to find a transmission scheme that requires the minimum number of link-level packet
transmissions to satisfy the requests of all clients1.
In short, if ρe denotes the set of link-level packets traversing on e ∈ E and U =⊎e∈E
ρe2,
then the objective is to find a mapping ζ—where Pζ7−→ U subject to packet routing
constraints—that minimizes |U |, such that for each ci ∃ψi—a decoding function—and
ψi(ui, Hi) = Wi, where ui represents the packets received by ci; i.e., each ci should be able
to recover its Want set from the packets received and the side information available to it.
We want to comment that two different packets might carry the same payload, and
coding server exploits this to reduce number of link-level transmissions. We elaborate it
further in the following paragraph in context of example in Section 4.4.1. Moreover, for
1A packet traversing an edge e ∈ E corresponds to one link-level packet transmission.2⊎
denotes disjoint union, where⊎i
ρi ,⋃i
(%, i) : % ∈ ρi.
Chapter 4. A Coding Based Optimization for Big Data Processing 47
Figure 4.5: An instance of spate coding for the example 4.4.1.
practical implementation scenarios a client might need some additional information—e.g.,
a coding vector in form of a header—to decode the received coded packet. We discuss
this in Section 4.8.1.2.
The example in Section 4.4.1 can be represented as an instance of a spate cod-
ing problem, where P = P1, P2, · · · , P12, X1 = AS, X2 = S1, S2, and X =
node1, node2, node3, node4. The coding server is co-located with L2 − Aggregate
switch AS. There is a set of four clients C = reducer1, reducer2, reducer3, reducer4,
where ci = reduceri is hosted on nodei. The Want set of c1 is W1 = P8, P12. The side
information available to the c1 is given by its Has set H1 = P1, P2, P3, and similarly for
rest of the clients as shown in Figure 4.5. Note, although reducer1 needs P4, it is not in-
cluded in W1 as reducer1 does not fetch it via AS rather it is fetched via S1 from mapper2.
Furthermore, u1 = u2 = P8 ⊕ P11, P9 ⊕ P12, and u3 = u4 = P2 ⊕ P5, P3 ⊕ P6. The
total number of link-level packets transmitted, |U |, is 12, where each of the four pack-
ets transmitted by coding server results in three link-level transmissions. For instance,
packet P8 ⊕ P11 traverses three edges (AS, S1), (S1, node1), and (S1, node2). Further-
more, ρ(AS,S1) = P8 ⊕ P11, P9 ⊕ P12. Note that P4, P8, and P12, though different
packets, carry the same payload, same is true for the following sets of packet: P1, P9, P11;
Chapter 4. A Coding Based Optimization for Big Data Processing 48
P2, P6, P10; P3, P5, P7. Furthermore for extracting packet P8 from P8⊕P11, by performing
(P8 ⊕ P11) ⊕ P1, reducer1 needs information about equivalence of packets P11 and P1.
Mapping ζ can be computed using Algorithm 1, whereas the decoding function ψi is
specified by Algorithm 2. We present the proposed solution to spate coding problem in
Section 4.5.3.
4.5.2 Spate Coding and its Relationship with Network Coding
and Index Coding
In this section, we describe spate coding problem in context of traditional network coding
and index coding problems. We begin by noting that though all these coding problems
exploit mixing of the packets, each of these target a different environment that effect the
problem characterization and solution space. Index coding problem was first introduced
in 1998 [94], and network coding was first introduced in 2000 [106], and it was not
until 2015 [107, 108] when a first comprehensive relationship was established between
these problems, though some previous work focused on establishing relationship for some
specific scenarios (e.g., [109]). We emphasize that spate coding problem is different from
both the standard network coding and index coding problems [93–97,110–112].
We first start by highlighting the relationship between index coding problem and spate
coding problem. Specifically, every instance of index coding problem can be translated
to a corresponding instance of spate coding problem as given by Theorem 14.
Theorem 14 For each instance of index coding problem there exists a corresponding
instance of spate coding problem.
The proof of Theorem 14 is given in Appendix B.1.
Next, we proceed to briefly describe the relationship between network coding problem
and spate coding problem. Theorem 15 shows that every instance of network coding
problem can be translated to a corresponding instance of spate coding problem.
Chapter 4. A Coding Based Optimization for Big Data Processing 49
Theorem 15 For each instance of network coding problem there exists a corresponding
instance of spate coding problem.
The proof of Theorem 15 is given in Appendix B.2.
We want to point out that it has already been shown that network coding and in-
dex coding problems are equivalent [108], however it is still unknown whether network
coding/index coding can subsume spate coding as a special case.
4.5.2.1 Discussion
Index coding problem is closely related to spate coding problem by virtue of incorporat-
ing side information. However, index coding problem has been defined for the networks
communicating over a broadcast channel, whereas spate coding problem incorporates the
networks communicating over a non-broadcast channel. Spate coding extends and gener-
alizes the concept of index coding problem to non-broadcast environments. To highlight
the difference that a non-broadcast environment can make we present a simple example
below showcasing that, for the similar instances, the solution to index coding problem can
be counter productive in non-broadcast environments for which spate coding has been
developed. This example also highlights the inability of existing index coding solutions
to capture spate coding scenarios.
Example 16 Consider an instance of spate coding problem as shown in Figure 4.6(a).
The similar instance of index coding problem is shown in Figure 4.6(b). The similarity
of the instances is in terms of the number of clients, and the same Want and Has sets
associated with each client. It can be verified that the number of link-level packet trans-
missions required with traditional routing is 16 to complete the exchange task in Figure
4.6(a). In contrast even if we use the optimal solution for a similar instance of index
coding problem shown in Figure 4.6(b) to complete the exchange task in Figure 4.6(a), it
requires 26 link-level packet transmissions in total to complete the exchange task (eight
Chapter 4. A Coding Based Optimization for Big Data Processing 50
link-level packet transmissions for uplink plus eighteen link-level packet transmissions for
downlink). Therefore, the transition from broadcast medium to non-broadcast environ-
ment need a new problem framework and modelling.
Moreover, current solutions to network coding problem don’t entertain spate coding
problem. Most of the work in network coding domain is focused on multicast scenarios,
and network coding in such scenarios is a polynomial time problem, accordingly the
proposed solutions are of polynomial time complexity. Keeping in mind that spate coding
is an NP-hard problem such solutions and algorithms can not solve spate coding problem
unless P=NP. Furthermore, an in-variant necessary for the correctness of network coding
solutions is the ability to perform coding at the nodes of degree two or more in the
information flow graph, whereas in spate coding only one node can perform coding—
violating this invariant.
Moreover, we propose to exploit two fundamental properties common to most dis-
tributed big data frameworks. One, sharing of a physical node among several processes
giving rise to side information, i.e., availability of a process’s output to other processes
hosted by the same node without going through the network. Two, data exchange among
processes distributed across different physical nodes to complete the desired task giving
rise to the opportunity of mixing different flows to optimize east-west traffic. None of the
existing solutions to the network coding problems considers the side information available
at the receivers while performing coding operation at network codes and hence cannot
be used to solve the problem setup presented in this work.
Furthermore, although we initially formulated the spate coding problem while tar-
geting the distributed application scenarios, however, the proposed scheme is applicable
to any setup where different nodes contributing in data exchange share some common
information.
Chapter 4. A Coding Based Optimization for Big Data Processing 51
Figure 4.6: (a) An instance of spate coding problem. (b) A similar instance of indexcoding problem.
4.5.3 Solution for Spate Coding Problem
This section presents a solution for spate coding problem. For efficiency, the only coding
operation that the server is permitted to perform is ⊕ i.e., operations over GF (2).
We start by defining the dependency graph to capture information shared between
the clients (nodes).
Definition 17 (Dependency Graph) Without loss of generality, for the ease of pre-
sentation, we assume each client ci requires just one packet. In case a client χ requires
|W | > 1 packets, we replace it by |W | clients, ci,1, · · · , ci,|W |, each having one of the |W |
packets as its Want set and each having the same Has set as of χ.
Given an instance of the spate coding problem, we define a dependency graph G(V,E)
with vertex set V and edge set E as follows:
• For each client ci there is a vertex vi in G(V,E);
• For any two clients ci and cj not residing on the same node (server), there is
a directed edge, in G(V,E), from vi to vj if and only if it holds that for Pi ∈
Wi,∃ P ∈ Hj : P ≡ Pi, where ≡ represents equality as described in Section 4.4.1;
Chapter 4. A Coding Based Optimization for Big Data Processing 52
in other words, an edge is directed from vi to vj whenever cj has a packet equal to
the one wanted by ci
Algorithm 1 ComputeCode
Input: An instance of spate coding problem1: X ′ := φ2: while X 6= φ AND X 6= X ′ do3: For each set of clients X = c1, c2, · · · , cy such that W1 ≡ W2 · · · ≡ Wy, X
′ :=
X \ X \ c14: From the given instance of spate coding problem for the clients in X ′, construct
the dependency graph G(V,E);5: for k = λ down to 2 do6: while ∃ a clique ϑ in G(V,E) of size k do7: Divide clique into x smaller cliques θi each belonging to different subtrees in
the multicast topology.8: for i = 1 to x do9: Multicast a packet that satisfies all clients corresponding to the vertices in
θi;10: V := V \ θi;11: Delete the clients corresponding to the vertices in the θi from the client set
X12: end for13: end while14: end for15: end while16: if X 6= φ then17: Send the uncoded packets corresponding to the clients in X18: end if
Figure 4.7 shows the dependency graph for the Example in Section 4.4.1, where each
client ci is split into two vertices vi,a and vi,b as |Wi| = 2.
Lemma 18 A clique of size n in the dependency graph represents a group of n clients,
whose requests can be satisfied with only one coded packet.
Proof: Follows from the fact that all the clients represented by a clique in the
dependency graph, can be satisfied by one transmission which consists of an ⊕ of all the
packets in their Want sets.
Chapter 4. A Coding Based Optimization for Big Data Processing 53
Figure 4.7: Dependency graph for the transmissions crossing AS for the Example givenin 4.4.1. The execution of the Algorithm 1 results in four multicast packets shown byfour cliques each of size 2.
The proposed Algorithm 1, named ComputeCode, greedily packs as many cliques
as possible starting from the largest clique of size λ. Each clique corresponds to only
one multicast packet, therefore packing cliques starting from larger cliques heuristically
ensures more savings in terms of number of transmissions. However finding all cliques
can become computationally prohibitive, we therefore focus on finding all the cliques
of size λ ≤ 4 to allow close-to-realtime execution of the coding algorithm. Increasing
λ could possibly increase coding benefit. However, in addition to time-versus-coding
benefit tradeoff, another important factor to consider while selecting λ is the maximum
number of leaf nodes in the subtree rooted at the middlebox. λ should not exceed the
maximum number of leaf nodes in the subtree, since the maximum coding advantage is
achieved for the case when all the packets heading to the leaf nodes are coded together
as one packet.
The computational complexity of our clique packing algorithm is O(|V |4). In general,
extending the same algorithm for finding all cliques of size λ and smaller results in O(|V |λ)
Chapter 4. A Coding Based Optimization for Big Data Processing 54
computational complexity. We note that the computational complexity of clique finding
can be significantly improved to O(κλ−2|E|) using algorithms similar to [113], where κ
is the arboricity of the dependency graph. For graphs with small κ, e.g., planar graphs
with κ = 3, clique finding is a very efficient operation of O(|E|).
To reduce the overhead associated with each multicast packet, we exploit the relation-
ship between a multicast packet, corresponding multicast tree, and corresponding clique.
We note each sub-clique, a subset of the vertices and corresponding edges of a clique,
encodes fewer packets than the original clique. Moreover, a sub-clique corresponds to a
sub-tree of the multicast tree associated with the original clique. Hence, we divide a clique
in sub-cliques in a way such that each multicast sub-tree carries a packet that encodes
a subset of the packets in the original encoded packet, resulting in less coding overhead.
For instance for packet format proposed for Hadoop in Section 4.8.1.2, the header of each
multicast packet conveys the information regarding each of the packet in CODED SEG
using CODING VECTOR, RIDK, UNCODED SEG. So if CODED SEG contains fewer
packets the size of CODING VECTOR, RIDK, UNCODED SEG is smaller. Also note
that since each sub-clique is also a clique so the validity of the solution holds i.e., each
client can decode instantaneously without any need to buffer the packets it received.
Further, dividing a clique into smaller cliques does not increase the number of overall
link-level packet transmissions as it is equivalent to separating information intended for
different clients.
The execution of the proposed algorithm resulted in two cliques each of size 4 (shown
with dotted lines in Figure 4.7) for Example in Section 4.4.1. These cliques are further
subdivided into two sub-cliques, the sub-cliques associated with the multicast sub-tree
rooted at the switch S1 are c1,1, c2,1 and c1,2, c2,2, and sub-cliques associated with
the multicast sub-tree rooted at the switch S2 are c3,1, c4,1 and c3,2, c4,2 as shown in
Figure 4.7. This corresponds to multicasting packets p2⊕ p5 and p3⊕ p6 to reducer3 and
Chapter 4. A Coding Based Optimization for Big Data Processing 55
reducer4, and multicasting packets p8 ⊕ p11 and p9 ⊕ p12 to reducer1 and reducer23.
4.6 Challenges and Considerations
Some of the obvious questions with regards to our proposal of using the mixing (coding)
are:
1. How does the coder (middlebox) collect the side information, and determines des-
tination of each coded packet?
2. How is the spate coding problem incorporated in the proposed architecture?
3. What is the coding format, and how can one determine which parts of 〈key, value〉
pairs should be encoded?
4. How does one encode the packets while being able to perform practical line-rate
processing?
5. What extra information is required in each packet, in addition to a packet’s payload,
so that each client can decode the required packets instantaneously.
6. How does a client decode a packet?
7. Can we say something about computational complexity of such schemes in general,
and provide some bounds on the advantage?
8. Can the proposed architecture integrate seamlessly in the current big data archi-
tectures (like Hadoop)?
9. How does the proposed scheme perform in practice?
3Note multicasting four packets p2 ⊕ p5, p3 ⊕ p6, p8 ⊕ p11, and p9 ⊕ p12 instead of two packetsp2⊕p5⊕p8⊕p11 and p3⊕p6⊕p9⊕p12 (as explained in example given in Section 4.4.1) does not increasethe overall number of link-level packet transmissions crossing the network. For instance the number oflink-level packet transmissions crossing network bisection for multicasting packet p2 ⊕ p5 ⊕ p8 ⊕ p11 toall four reducers use both the link AS−S1, and AS−S2, whereas multicasting packet p2⊕ p5 (p8⊕ p11)uses only one link AS − S1 (AS − S2).
Chapter 4. A Coding Based Optimization for Big Data Processing 56
We present a complete architecture that encompasses the questions 1 to 6 in Section
4.8.1. We present some analysis related to question 7 in Section 4.7. The answer to the
question 8 is discussed in Section 4.8.3, and regarding question 9 we present evaluation
results from a clustered prototype implementation of proposed scheme in Section 4.9.
4.7 Analysis
In this section we analyze the computational complexity and advantage of the schemes
that tend to minimize data transmission in distributed data center applications.
A distributed data center application runs over a set of DataNodes hosting the sender
service instances that send data to other nodes, and the receiver service instance that
receive data from other nodes. A DataNode can host both the sender service instances
and the receiver service instances. The data is transferred over a communication graph
I(V,E) from the sender service instances to the receiver service instances in the com-
munication phase. The vertex set V consists of DataNodes as well as the routing and
forwarding nodes in the communication graph. The edge set E consist of all the com-
munication links connecting the vertices in V . For instance in Hadoop MapReduce, a
mapper is a sender service instance, a reducer is a receiver service instance, and shuffle
is the communication phase.
We first define the Transmission Intensity of a distributed data center application.
Definition 19 (Transmission Intensity) Transmission intensity is defined as the to-
tal number of link-level packet transmissions during the communication phase of dis-
tributed data center application. The transmission intensity of a communication scheme
Si is denoted by |P (Si)|.
A scheme S is called Transmission Intensity optimal if for all other schemes Si:
|P (S)| ≤ |P (Si)|.
We next define the problem efficient scheme (ES).
Chapter 4. A Coding Based Optimization for Big Data Processing 57
Definition 20 (Problem ES) Find a scheme S that is Transmission Intensity optimal.
Let OPT (S) denote the optimal solution to the problem ES.
Theorem 21 The Problem ES is NP-hard, and it is NP-hard to find a polynomial time
constant factor approximation as well.
The proof of Theorem 21 is given in Appendix B.3.
We begin to analyze the maximum and minimum advantage that prosed coding based
scheme can offer compared to the current standard non-coding (routing) based tech-
niques. We start by defining the Utilization Ratio.
Definition 22 (Utilization Ratio (µ)) Utilization Ratio µ is the ratio of link-level
packet transmissions when employing a coding based solution to the number of link-level
packet transmissions while not employing a coding based solution.
We also define the diameter of a network, represented by a communication graph, as
follows.
Definition 23 (Diameter of a Network (d)) The diameter of a network d is the longest
of all the shortest paths (in terms of number of hops) between the vertices representing
distinct DataNodes in the communication graph.
Theorem 24 provides a lower bound on the utilization ratio for a general network in
terms of its diameter d.
Theorem 24 µ ≥ 1d.
The proof of Theorem 24 is given in Appendix B.4.
Corollary 25 For distributed data center applications running on large number of nodes
(larger than the network diameter), µ ≥ 1number of nodes
.
Theorem 26 µ ≤ 1, and this bound is a tight bound.
Chapter 4. A Coding Based Optimization for Big Data Processing 58
The proof of Theorem 26 is given in Appendix B.5.
Next, we analyze the maximum advantage a coding based scheme can offer with refer-
ence to a network’s bisection when the coding is also performed at the network’s bisection.
We proceed by defining Utilization Ratio with reference to a network’s bisection.
Definition 27 (Bisection Utilization Ratio (µ(bisection))) Bisection Utilization Ra-
tio µ(bisection) is the ratio of link-level packet transmissions crossing the network’s bi-
section when employing a coding based solution to the number of link-level packet trans-
missions crossing a network’s bisection while not employing a coding based solution.
Theorem 28 provides an upper bound on µ(bisection) while coding is also performed at
the network’s bisection.
Theorem 28 µ(bisection) ≥ 12, while coding is also performed at the network’s bisection,
and this bound is a tight bound.
The proof of Theorem 28 is given in Appendix B.6.
Theorem 29 Network Bisection is not an optimal location to place the middlebox for
performing coding.
The proof of Theorem 29 is given in Appendix B.7.
4.8 Theory to Practice
4.8.1 Coding Based Middlebox and its Components
For the ease of explanation as well as in accordance with our proof of concept prototype
implementation, we present the details of our architecture with reference to Hadoop
MapReduce. It can be easily extended to other distributed data center applications.
Moreover, the proposed coding scheme is independent of the underlying application.
Chapter 4. A Coding Based Optimization for Big Data Processing 59
Figure 4.8: The proposed coding based Hadoop MapReduce data flow.
We introduce three new stages, namely sampler, coder, and preReducer to the traditional
Hadoop MapReduce. The primary objective of the sampler is to gather side information.
Similarly the primary objective of the coder is to code, and of the preReducer is to decode.
The overall architecture is shown in Figure 4.8; while it shows only two nodes it is in fact
replicated across all the nodes.
In this section we focus on the middlebox, and its components (sampler, and coder).
We begin by introducing the notation.
We consider n nodes or physical machines, where a mapper j on physical machine i is
denoted by Mij, and a reducer j on physical machine i is denoted by Rij. Each mapper
Mij emits its 〈key, value〉 pairs to the partition, a memory buffer, Γi.
Chapter 4. A Coding Based Optimization for Big Data Processing 60
Figure 4.9: Architecture for middlexbox (sampler, and coder)
Figure 4.10: Architecture for preReducer.
Chapter 4. A Coding Based Optimization for Big Data Processing 61
4.8.1.1 Sampler
The sampler gathers the side information. Specifically, the sampler resides in the middle-
box and fetches ℵ random records stored in the sorted and partitioned file spills from the
mappers at each of the physical machines. These random records are fetched in parallel
to the shuffle phase. This sampling process does not interfere with the shuffle process.
These sampled records not only help to start building the side information available at
each node but also play a pivotal role in the decision of which portions (segments) of the
packets should be coded (please refer to the Example in Section 4.4.2 regarding partial
coding).
The overall data flow from a mapper to the sampler is shown in Figure 4.9. Specifi-
cally, a mappers first completes emitting the 〈key, value〉 pairs to the intermediate spills
which are formed as a result of circular memory buffer being filled up to a certain thresh-
old (io.sort.spill.percent). These intermediate spills are then merged to form a final sorted
and partitioned sequence file. In the final sequence file each partition corresponds to a
particular reducer, and is made available to the reducer over HTTP. The sampler, co-
residing with the coder, fetches the random records, in parallel to shuffle phase, from
the final sequence file which is stored in SequenceFile format at the locations specified
by mapred.local.dir [41]. The records from the sequence file is read using java class
org.apache.hadoop.io.SequenceFile.Reader [114]. These sampled 〈key, value〉 pairs serve
as the initial Has set for each physical machine.
Note that gathering the side information incurs an overhead and is the reason that
the proposed scheme suggests to take a limited number of samples instead of fetching
the complete file. This limited side information affects the coding advantage. However,
for the sake of practicality it is important that the communication overhead associated
with collection of side information is small. Note that later on, as the reducers fetch the
data from the respective mappers, the packets crossing the network middlebox gradually
enhance side information for each of the contributing nodes, without incurring addi-
Chapter 4. A Coding Based Optimization for Big Data Processing 62
tional network overhead. The aggregated side information results in an improved coding
advantage.
Spate coding needs an access to side information from each of the participating nodes
in order to decide on combination of packets to be coded. More precisely, some knowledge
about packets, not crossing the middlebox, from mappers’ outputs at each of the node is
required. For instance the server can compute packet p8 ⊕ p11, for node1 and node2, in
the example given in Section 4.4 only if it knows that node1 has access to packet p1 and
node2 has access to packet p4.
4.8.1.2 Coder
The coder C is a dedicated software appliance, strategically placed in the middlebox,
for wire-speed packet payload processing (typically XORing packet payloads) and re-
injection of coded packets into the network. Specifically, the coder C initially receives ℵ
inputs from the partitions of all mappers via the sampler. After that it only receives the
information that passes through the part of the network where it is placed. The coder
performs the following three functions:
(1) Format Decision Making: Based on the sampled data records the coder de-
cides on a coding format consisting of byte indices ω1, ω2. The coder treats a 〈key, value〉
pair as a data chunk. In a data chunk the bytes starting from index ω1 and ending at
index ω2 are the ones that are anticipated for coding for the following ℵc generations; we
name these bytes the encodable chunk. A generation specifies a group of packets to be
processed together by the coder. The rest of the bytes in the data chunk, named the un-
codable chunk, are forwarded without coding. In other words, the coding format specifies
which portions of the collective 〈key, value〉 pairs to code. The logic behind choosing ω1,
ω2 is to find partitions of the data chunk that maximize coding advantage, as for a specific
Hadoop job some bytes of 〈key, value〉 pairs might contain more common information
than others. Note that coding exploits the common information between different file
Chapter 4. A Coding Based Optimization for Big Data Processing 63
splits residing on different physical machines. For the example in Section 4.4.2, if each
digit in the 〈key, value〉 pair is represented by a byte then ω1 = 2, ω2 = 28, and the
uncodable chunk is just the first byte and the rest of the packet is the encodable chunk.
The decision on the coding format is based on comparing the coding advantage over
the ℵ random records fetched by the sampler from different mappers. Coding advan-
tage is evaluated on different chunks of the data packet for different values of ω1, ω2.
The computational complexity of determining the best contiguous encodable chunk is
logarithmic (a very efficient procedure based on binary partition). The complexity of
finding arbitrary number of noncontiguous encodable chunks can result in higher coding
advantage, but can also increase the computational complexity, more book-keeping, and
a more complex packet header.
The coding format is periodically re-computed after every ℵc generations to fine tune
the decision based on particularities of the Hadoop job.
(2) Coding: This step performs binary coding (bitwise XOR) based on the bytes
ω1 to ω2 (encodable chunk) from each generation of received 〈key, value〉 pairs. The
coding algorithm is based on solving spate coding problem in an efficient fashion, and is
presented in Section 4.5.3. We note that the algorithm requires information about both
the Want set and side information (Has set) of each node in the cluster participating in
the Hadoop job under consideration, and packets that are equal. In our scheme, the Want
set is determined based on the key of the 〈key, value〉 pairs of the current generation,
whereas knowledge about Has set keeps on building at each generation by adding it to
already existing side information set from the sampler. During each generation the new
side information is extracted from the bytes ω1 to ω2 from the 〈key, value〉 pairs available
at the coder. For the coding purpose, we maintain the Has set of all the nodes without any
assumption over the extent of information overlaps between different physical machines.
The information overlaps between different physical machines can be arbitrary or none.
Moreover, the coding can result in multiple multicast packets to be forwarded to the
Chapter 4. A Coding Based Optimization for Big Data Processing 64
reducers. Each multicast packet is computed for a subset of reducers sharing a common
down-link path originating at the coder node. To enable processing at the line-rate, the
algorithm ensures that the coded packets are instantaneously decodable at the receiver
nodes i.e., just based on the packets received and without any need of buffering a set of
incoming packets.
(3) Packaging: This step packs the outcome of the coding process into a custom
packet payload format, as shown in Figure 4.12, that is then re-injected into the network
towards the reducer nodes. Different fields in the packet payload format collectively
ensure that each reducer finally receives the 〈key, value〉 pairs intended for it. The fields
are:
• NUM ENCODED: It is a numerical value describing the number of packets that
are coded together. For instance, for the packet p2 ⊕ p5 ⊕ p8 ⊕ p11 of Example in
Section 4.4.1, this value is 4. In case where no encoding is performed, this field is
set to 0. This value is bounded by the number of reducers.
• CODING V ECTOR: It is a vector of size NUM ENCODED, and contains
hashes of the encodable chunks. There can be a large set of encodable chunks, so
we use hashes for their fast matching with the Has set to identify side information
for the decoding process given in Section 4.8.2.
• RIDK: It is a vector of size NUM ENCODED, and specifies intended reducers
along with its corresponding key. We associate an ID (a bit vector) RID with the
reducer R, and RK denotes the key assigned to the reducer R. RIDK contains
hashes of RID RK pairs. There can be a large number of reducers in a Hadoop
cluster, and one coded transmission is multicasted to only a subset of the reducers.
Moreover, a reducer might work on multiple keys, and does not know in advance
what 〈key, value〉 pair to expect in the received packet, so it is necessary that a
packet should contain information about the intended reducers and its correspond-
Chapter 4. A Coding Based Optimization for Big Data Processing 65
Figure 4.11: Coding for variable packet lengths.
ing key. Hashes are used for fast identification of the reducer and the key.
• CODED SEG: It is the actual coded chunk, formed by bitwise XOR of the
NUM ENCODED encodable chunks. For instance, one of the coded packets of
Example in Section 4.4.1 the CODED SEG contains p2 ⊕ p5 ⊕ p8 ⊕ p11.
• UNCODED SEG: It contains NUM ENCODED uncodable chunks.
Discussion
Note that there is no assumption regarding the packets (〈key, value〉 pairs) to be of equal
length. To elaborate further, consider two packets of different lengths as shown in Figure
4.11, showing the encodable chunks. The coder performs XOR on encodable chunks
while uncodable chunks are concatenated for each packets. The resulting payload for
coded packets is shown in Figure 4.11, elaborating the packaging of the encodable and
the uncodable chunks. The coded packet also contains the required header.
Chapter 4. A Coding Based Optimization for Big Data Processing 66
Furthermore, we argue—in practical scenarios as demonstrated in Section 4.9—the
overall reduction in communication is much more than the overhead required to support
coding/decoding operations. To support our argument, we calculate net savings—total
reduction in volume of communication minus total overhead—per coded packet at output
link from the middlebox as follows:
(τ∑i=1
(li)−
(k +
τ∑i=1
(li − k)
))− (2 ∗ h ∗ τ + x)
where k = ω2−ω1 bytes from τ packets—p1, p2, · · · , pτ of lengths l1, l2, · · · , lτ respectively—
are coded together. Overhead comprises of h bytes of hashes—used for each entry in
CODING V ECTOR and RIDK fields—and x bytes for NUM ENCODED field.
For example, consider a scenario where a part from each of three 100 bytes packets
is coded together. Specifically 90 bytes from each of the packet are coded together
with h = x = 1. Net savings per coded packet at output link from the middlebox are
(300− (90 + 3 ∗ 10))− (2 ∗ 1 ∗ 3 + 1) = 173 bytes, i.e, per coded packet gross savings are
180 bytes, overhead is 7 bytes, and the net savings are 173 bytes. Note that number of
bytes for an XOR operation over k bytes from each of τ packets is still k. Furthermore,
overall number of hops for an encoded or traditional packet are same.
4.8.2 PreReducer
The major role of the preReducer component is to ingest the custom-made packets sent
by the coder, decode their payloads and extract the 〈key, value〉 pairs which are to be
input to the standard Hadoop Sorter.
We recall that a mapper stores the output 〈key, value〉 pairs in a set of partitions,
where each partition contains 〈key, value〉 pairs for a particular reducer. We further note
that based on the placement of the middlebox, or some routing intricacies some of the
〈key, value〉 pairs fetched by a reducer might not pass through the middlebox. Hence,
Chapter 4. A Coding Based Optimization for Big Data Processing 67
Figure 4.12: The proposed packet payload format for the coding based shuffle at theapplication-layer. Depending on size, the format shown may be transported using mul-tiple segments/packets.
preReducer should tackle both types of the packets, namely packets coming through
the middlebox with additional header, and packets not coming through the middlebox.
Specifically, the preReducer passes the packets not coming through the middlebox to the
reducer without any changes. The preReducer performs decoding, and < key, value >
extraction process on the packets coming through the middlebox. The complete archi-
tecture for the preReducer is shown in Figure 4.10.
The preReducer decodes the coded part of the packet based on the coding format,
and forms 〈key, value〉 pairs by inserting the decoded chunk from byte index ω1 to ω2
into the corresponding uncoded chunk of the packet. Once a node receives a packet,
the preReducer first checks the RIDK field to determine the intended reducer and the
corresponding key. Then it identifies the side information, required to decode its intended
encodable chunk, by comparing the hashes of the mappers outputs stored locally with
the hashes contained in CODING V ECTOR. The decoding is performed by XORing
CODED SEG with the side information. The detailed decoding process is given in in
Algorithm 2.
Chapter 4. A Coding Based Optimization for Big Data Processing 68
Algorithm 2 PreReducer1: n = NUM ENCODED2: if n 6= 0 then3: Retrieve the index myIndex from the packet by sifting through the RIDK vector4: decoded Seg = CODED SEG5: for i = 1 to n do6: if i 6= myIndex then7: decoded Seg = decoded Seg ⊕ lookup(Hashi)8: end if9: end for
10: Retrieve 〈key, value〉 by inserting decoded Seg at index ω1 + 1 in theUNCODED SEG at index myIndex
11: else12: 〈key, value〉 is same as packet received13: end if
4.8.3 Seamless Integration using OpenFlow
For a seamless integration of the proposed scheme into the Hadoop architecture, we
propose a novel software-defined-networking, specifically OpenFlow, based scheme for
multicasting coded packets to the right reducers. Unlike conventional receiver-initiated
multicast, the use of OpenFlow (version 1.2) provides the ability to have the middlebox
control the dynamics of multicast groups on-demand. Hadoop contains information about
the placement of mappers and reducers in the cluster, and the topology they are connected
through [115]. Using this information mappers and reducers can be grouped into subtrees.
The proposed scheme implements multicast state across a multicast group G—at each
switch along the tree topology—using a Group Table entry with Group ID=G. The Group
Type is set to all (multicast), and respective Action Bucket is set to forward the packets
to the set of destination ports corresponding to the multicast tree links. We overload
here the semantic of two IP header fields, namely the ECN and DS fields (8-bits long
in total), to match multicast group state on a switch with packets of a specific multicast
group, allowing us to use 256 multicast groups in total. An alternative is to use Locally
Administered multicast MAC addresses, which can allow handling of up to 246 multicast
groups.
Chapter 4. A Coding Based Optimization for Big Data Processing 69
The following two practical constraints are associated with the implementation of
multicast in large-scale data centers. First, keeping a multicast group for every coding
combination, determined by the coder, may be prohibitively expensive. Second, creating
mulitcast groups on-demand can incur high latency. To meet these two practical con-
straints, we introduce a component called the coding groups coordinator (CGC) within
the coder. The CGC coordinates, in tandem with the OpenFlow controller, the dynamic
creation of multicast groups using a small (constant) working set of multicast groups.
Specifically, the CGC maintains at most α (a small constant) group-queues. Each group-
queues is associated with a distinct set of reducers (group of receivers) and being identified
by a unique Queue-ID. In addition to this, the CGC maintains one general queue that
is drained to the network. Packets from the coder destined to the same multicast group
are buffered in the same group-queue, whereby at the event of adding the first packet to
a group-queue, the CGC notifies the OpenFlow controller to create network state for the
multicast group that the packet corresponds to. Once a queue is fully flushed, the CGC
picks the next multicast group from the general queue and assigns the emptied queue
to a new multicast group, while in parallel the next group-queue containing the packets
corresponding to another multicast group is drained to the network (i.e., one group-queue
is drained at every instance of time). This ensures a temporal overlap of multicast state
creation for a future multicast group with injection of packets to the network, while also
operating on a fixed budget of multicast state (#groups). In short, our approach some-
what resembles with the concept of virtual queues in network routers, without though
the constraint of starvation, for in this case the workload does not pose deadlines on
identical virtual queues due to Hadoop being throughput and not latency-bound.
Figure 4.13(b) describes the process of multicasting four encoded packets P2 ⊕ P5,
P3⊕P6, P8⊕P11, and P9⊕P12, found as a result of execution of Algorithm 1 applied on
the example in Section 4.4.1. Figure 4.13(a) shows the entries for Flow Tables and the
corresponding Group Tables for each switch in the multicast tree. Note that the encoded
Chapter 4. A Coding Based Optimization for Big Data Processing 70
packets P8 ⊕ P11 and P9 ⊕ P12 are to be multicasted to the multicast group consisting of
node1 (hosting reducer1) and node2 (hosting reducer2). Similarly, the encoded packets
P2 ⊕ P5 and P3 ⊕ P6 are to be multicasted to the multicast group consisting of node3
(hosting reducer3) and node4 (hosting reducer4). In particular, the CGC creates two
queues, the queue with Queue-ID=1 for the multicast group consisting of node1 and
node2 (destination of P8 ⊕ P11 and P9 ⊕ P12), and the queue with Queue-ID=2 for the
multicast group consisting of node3 and node4 (destination of P2⊕P5 and P3⊕P6). The
CGC then creates the network state for the these multicast groups. For all the packets in
the queue with Queue-ID=1 the DS+ECN field is set equal to 1, and for all the packets
in the queue with Queue-ID=2 the DS+ECN field is set equal to 2. The Flow Table
entries and corresponding match-actions inspects the DS+ECN field of a packet, and
then utilize Group Table to find the corresponding multicast group and then forward
based on the Action Bucket. For example, the packets with DS+ECN field equal to 1 are
forwarded according to the Action Bucket under Group ID 110, i.e., multicasted to the
ports a and b of the switch S1, and port x of the switch AS. Similarly, the packets with
DS+ECN field equal to 2 are forwarded according to the Action Bucket under Group ID
100, i.e., multicasted to the ports c and d of the switch S2, and port y of the switch AS.
4.9 Performance Evaluation
We developed a prototype as well as a testbed to evaluate the performance of the proposed
coding based approach. Section 4.9.1 describes the prototype and Section 4.9.2 describes
the testbed. We want to emphasize that the prototype and the testbed depict two of
the most commonly used real-world system development models, i.e., proprietary–vs–
open-source. The prototype was implemented in a data center using costly proprietary
tools, and hardware. The testbed, on the other hand, was implemented using open-
source tools, virtualized environments, and commodity off-the-shelf components. The
Chapter 4. A Coding Based Optimization for Big Data Processing 71
Figure 4.13: OpenFlow based multicasting using coding groups coordinator for the ex-ample given in section 4.4.1. Flow Tables and the corresponding Group Tables for eachswitch in the multicast is also shown.
Chapter 4. A Coding Based Optimization for Big Data Processing 72
experimental results showed the advantage of our proposed scheme in both models. We
use data from Hadoop shuffle to benchmark our proposed solution. The Hadoop jobs
consisted of the following two industry standard benchmarks used widely for performance
evaluation of big data frameworks. These benchmarks represent a wide spectrum of big
data work loads. Specifically the benchmarks are:
1. Terasort, a benchmark to sort 10 billion records (1 terabytes (TB)). These records
are generated using a program called TeraGen [116]. Each record consists of 100-
bytes, with the first 10 bytes being the key and the remaining 90 bytes being the
value. This benchmark represents the works loads from mathematical applications
to application using artificial intelligence.
2. Grep (Global Regular Expression) represents generic pattern matching and search-
ing on a data set. Our data set consisted of on an organization’s data-logs, whereby
the goal was to calculate the frequency of occurrence of eight different types of
events in the data-log input. Applications of Grep vary from data mining and
sequencing to anamoly detection.
We want to point out that these two benchmarks not only cover a wide range of big
data applications, but also cover a wide spectrum of the network traffic generated by big
data applications. In fact, Cisco has used the same two types of benchmarks to study
big data infrastructure considerations, and performance of their proposed solutions. Fur-
thermore, these two benchmarks capture the workloads with two very different traffic
patterns, namely “Business Intelligence (BI)” workload benchmark captured by Grep,
and “Extract, Transform, and Load (ETL)” workload captured by Terasort [117]. More-
over, these benchmarks are two of the most widely used standard practice benchmarks for
assessing performance of the Hadoop MapReduce implementations [42,68,79,118–120].
In addition to vanilla Hadoop, we have used in-network combiner as explained in
Section 2.3.1—a network-level optimization focused on aggregating mapper’s outputs for
Chapter 4. A Coding Based Optimization for Big Data Processing 73
reducing network traffic—as another baseline to compare the performance of the proposed
scheme. We want to point out that the in-network combiner is the current state-of-the-
art outperforming contemporary combiner as shown in Section 2.3.1. Below we present a
word count example demonstrating in-network combiner and the proposed coding based
scheme for the objective of reduction of network traffic.
Consider an eight node Hadoop cluster consisting of eight mappers and eight reducers.
The output from each of the mapper as well as the key that each of the reducer is
responsible for is shown in Figure 4.14. To evenly compare the benefit, the in-network
combiner as well as the coder are located at the aggregate switch AS. As shown in the
figure, the in-network combiner aggregates the packets 〈MAT, 1〉 originating from node1
and node2 into a new packet 〈MAT, 2〉, and 〈FAT, 1〉 from node3 and node4 to create
a new packet 〈FAT, 2〉. This results in 8% lesser traffic crossing the network bisection.
On the other hand, employing spate coding results in 33% lesser traffic by generating a
set of four packets that cross the network bisection as shown in Figure 4.14. We want
to point out that the above discussion pertains to the packets that cross the network
bisection where both the coder and the combiner are placed. Specifically, the packets
that are exchanged via s1/s2 can neither be combined nor coded. For example, the
packets 〈MAT, 1〉 are to be transferred from nodes node1, node2, node5, node7, and
node8 to node node6, however only the packets originating at nodes node1 and node2
cross the bisection. Similar discussion holds for packet 〈FAT, 1〉.
4.9.1 Prototype
We have prototyped sampler, coder, and preReducer (decoder) in a data center as an
initial proof of concept implementation.
Our testbed consisted of following type of racks:
• Compute Racks
Chapter 4. A Coding Based Optimization for Big Data Processing 74
Figure 4.14: A word count example.
Chapter 4. A Coding Based Optimization for Big Data Processing 75
• Management and Storage Racks
Compute Racks hosted mappers, reducers , and the middlebox. Each compute rack
consisted of blade-servers. Each server was equipped with twelve x86 64 Intel Xeon
cores, 128 GB of memory, and a single 1 TB hard disk drive. Each processor had 12
MB of Smartcache with the maximum memory bandwidth of 32 GB/s. The servers
used in prototyping were arranged in three racks. All the servers were running Red Hat
Enterprise Linux 6.2 operating system [121]. The components were implemented using
Java, and Octave [122]. Management Rack provided remote access to the compute racks.
Storage Rack was used for storing the home directory of the users.
The interconnect consisted OpenFlow enabled IBM RackSwitch G8264 as Top-of-
Rack switches, and OpenFlow enabled Pronto 3780 as Aggregation switches. RackSwitch
G8264 switch provided non-blocking line-rate switching with a throughput support of
1.28 Tbps. Pronto 3780 provides a bandwidth of 960 Gbps and supports up to 16K route
rules.
Figure 4.15 depicts the prototype setup.
We use the following metrics for quantitative evaluation:
• Job Gain, defined as the increase (in %) in the number of parallel Hadoop jobs
that can be run simultaneously with coding based shuffle compared to the number
of parallel Hadoop jobs achieved by standard Hadoop
• Utilization Ratio, defined as the ratio of link-level packet transmissions when em-
ploying coding based shuffle to the number of link-level packet transmission incurred
by the standard Hadoop implementation.
Our experimental study shows that for both of the tested benchmarks, the overhead to
implement coding based shuffle (in terms of transmission of extra bookkeeping data in
packet headers) was less than 4% in all the experiments. Table 4.1 shows the results
Chapter 4. A Coding Based Optimization for Big Data Processing 76
Figure 4.15: Prototype setup.
Chapter 4. A Coding Based Optimization for Big Data Processing 77
Hadoop Job Job Gain Utilization Ratio
Sorting 29 % .71Grep 31% 0.69
Table 4.1: Job Gain and Utilization Ratio using proposed coding based shuffle.
across the two metrics for the two benchmarks. The results show significant performance
of our scheme compared to the standard Hadoop implementation.
Noting the fact that our coding based scheme just requires XORing of packets which
is computationally very fast operation and given high memory bandwidth of the servers,
we were able to process closer to line rate. Specifically, even during the worst case
scenario, the throughput of the coder was 809 Mbps on a 1 Gbps link.
4.9.2 Testbed
Our testbed consisted of eight virtual machines (VMs), each running CentOS 7 as the
operating system [123], on top of x86 64 Intel i7 cores. CentOS is a free Linux distri-
bution aimed at providing enterprise class functionality compatible with Red Hat En-
terprise Linux [121]. We used Citrix XenServer 6.5 as the underlying hypervisor [124].
In addition to more than 1 TB of local hard disk and solid state drives, each VM had
access to 3 TB of network attached storage. Citrix XenCenter was used to manage the
XenServer environment and deploy, manage and monitor VMs and remote storage [125].
Open vSwitch [126] was used as the underlying switch providing network connectivity to
the VMs. Open vSwitch is a production quality, distributed, multilayer virtual switch
designed to enable massive network automation supporting OpenFlow as well as NIC-
teaming. The rest of the software implementation was the same as used in Section 4.9.1.
Each virtual rack contains VMs, and one of the VMs was used as a virtual appliance
for the middlebox as shown in Figure 4.16.
Moreover we have implemented a stand-alone split-shuffle, to provide better insights
Chapter 4. A Coding Based Optimization for Big Data Processing 78
Figure 4.16: Testbed architecture.
into shuffle dynamics, where receiver service instances (e.g., reducers) fetch file spills from
sender service instances (e.g., mappers) in a split fashion using the standard Hadoop http
mechanism.
To investigate performance of the proposed scheme as well as middlebox placement
in different scenarios, we used following two commonly-used data center topologies:
1. Tree topology (shown in Figure 4.17(a)) with the middlebox at the bisection. We
denote this topology by Top-1.
2. Tree topology with NIC-Teaming (shown in Figure 4.17(b)). Moreover, in this
topology the middlebox is placed at first L2-switch. We denote this topology by
Top-2.
We measured the following parameters of interest:
• Volume-of-Communication (VoC), defined as the amount of data crossing the net-
work bisection. Note that VoC and utilization ratio, although related to each other,
Chapter 4. A Coding Based Optimization for Big Data Processing 79
Figure 4.17: Topologies used for testbed: (a) Top-1 (b) Top-2.
measure two different quantities. Note that VoC consists of the true payload, i.e.,
total size of 〈key, value〉 pairs, as well as metadata (e.g., packet header including
hashes, coding vector etc) needed for the proposed solution, thus proportionately
reducing coding advantage. Furthermore, VoC provides a fair comparison ground
irrespective of underlying system. Note that, VoC is explicitly chosen to be differ-
ent from one used in theoretical analysis, i.e., number of packet transmissions. The
primary reason for selection of VoC is that it captures the network related param-
eters like underlying network protocol, packet size etc, as well as packet overhead
used for coding based scheme. On the other hand, a system dependent parameter
like VoC is not suitable for deriving theoretical results related to complexity and
bounds that are general enough to be applicable to practical setups.
• Goodput (GP), defined as the number of useful information bits delivered to the
receiver service instance per unit of time.
• %disk-utilization (%d-util), defined as the percentage of CPU time during which
I/O requests were issued to the storage disk. Disk saturation occurs when this
Chapter 4. A Coding Based Optimization for Big Data Processing 80
value is close to 100%. Higher %d-util means poor performance (disk bottleneck).
• Queue-size (Qsz), defined as average queue length of the requests that were issued
to the storage disk. Higher Qsz means longer wait time.
• Bits-per-Joule (BpJ), defined as the number of bits that can be transmitted per
Joule of energy.
The selected performance parameters are closely connected and more comprehensively
express the benefits that the proposed scheme has to offer in a holistic fashion capturing
both the network, and the nodes. Specifically, volume of communication, and number
of bits transmitted per Joule of energy focus on the network aspects, whereas goodput,
disk utilization, and queue size capture a node’s perspective.
VoC measure overall data transfer (network load) during the shuffle phase. On the
other hand, goodput measures the average effective throughput, crucial from an applica-
tion’s perspective. Furthermore, aside from the amount of data and data rate, the energy
required to shuffle data is an important parameter indicating the level of greener opti-
mization offered by the proposed scheme. Similar is the importance of measuring device
utilization and queue size capturing the advantage of the proposed scheme in terms of
availability of the system resources to different processes and virtual machines.
4.9.2.1 Volume-of-Communication
In this Section we compare VoC of our proposed approach with vanilla Hadoop and a
state of the art in-network combiner. An in-network combiner reduces the VoC by par-
tially distributing functionality of the receiver service instances over the network [42,44].
Additionally, to demonstrate the complementary nature of our approach, we deployed
the proposed coding based approach on top of in-network combiner (Combine-N-Code).
Chapter 4. A Coding Based Optimization for Big Data Processing 81
Figure 4.18: Normalized VoC using Grep benchmark for both topologies Top-1 andTop-2.
Figure 4.19: Normalized VoC using Terasort benchmark for both topologies Top-1 andTop-2.
Chapter 4. A Coding Based Optimization for Big Data Processing 82
Figure 4.18 shows the normalized VoC for all four scenarios using the Grep benchmark
for both topologies Top-1 and Top-2. Note that the normalized V oC of the proposed
coding based approach is the same as µ(bisection) defined in Section 4.7. The results
show that the proposed coding based approach, compared to vanilla shuffle, can reduce
the volume of communication by 43% for Top-1, and 59% or Top-2. Moreover, the coding
based approach outperforms in-network combiner both for Top-1 and Top-2 by 13% and
38% respectively. Furthermore, deploying Combine-N-Code can reduce the volume of
communication by a staggering 64% for Top-1, and 79% for Top-2. Figure 4.18 also
shows the 95% confidence interval for vanilla shuffle, in-network combiner, proposed
coding based scheme, as well as Combine-N-Code.
Figure 4.19 shows the normalized VoC for all four scenarios using the Sorting bench-
mark for both topologies Top-1 and Top-2. For the sorting benchmark in-network com-
biner did not reduce volume of communication at all. Whereas the proposed coding
based scheme leads to a 37% reduction volume of communication for Top-1, and 62%
reduction in volume of communication for Top-2 compared to both vanilla shuffle as well
as in-network combiner. Figure 4.19 also shows the 95% confidence interval for vanilla
shuffle, in-network combiner, proposed coding based scheme, as well as Combine-N-Code.
It is interesting to note that the coding based approach reduces more network traffic
on Top-2 as compared to Top-1. The reason for better performance for Top-2 is due
to the specific placement of the middlebox resulting in more data exchanged through
it, and hence giving rise to more coding opportunities. Specifically, for Top-1 56% and
67% of the data exchange happened through the L2−Aggregate switch (co-located with
the middlebox) for Grep and Sorting respectively. Whereas for Top-2 most of the data
exchanged passed through the L2 − switch (co-located with the middlebox) giving rise
to more coding opportunities.
The communication overhead associated with side information gathering for the sam-
pler for Grep and sorting were less than 1% (0.65% and 0.4% ).
Chapter 4. A Coding Based Optimization for Big Data Processing 83
Figure 4.20: Normalized VoC for different values of λ (maximum clique size) used inAlgorithm 1.
Figure 4.20 shows the normalized VoC for different values of λ (maximum clique size)
used in Algorithm 1. Increasing λ increases the coding advantage (decreases VoC ). The
maximum coding advantage is achieved at λ = 8 which conforms to the discussion in
Section 4.5.3.
4.9.2.2 Goodput
In this Section we compare the proposed approach with vanilla shuffle for GP measured at
the receiver service instance. Figure 4.21 shows that the coding based scheme outperforms
vanilla shuffle at all link rate settings. Moreover, the coding benefit increases with the
increase in link rates (GP for coding based scheme is is 55% higher than vanilla shuffle at
500 Mbps and grows to 76% at 1000 Mbps). Figure 4.21 also shows the 95% confidence
interval for both vanilla shuffle and proposed coding based scheme.
Moreover, we investigate the impact of oversubscription ratio on GP for different
link rates. The oversubscription ratios were implemented by generating constant bit
Chapter 4. A Coding Based Optimization for Big Data Processing 84
Figure 4.21: Goodput versus link rates for sorting benchmark for topology Top-1.
rate background TCP traffic using iperf tool. Figure 4.22 shows the average GP for
oversubscription ratios of 1, 5, 10, and 15, where the link rate was fixed at 500Mbps.
Figure 4.23 shows a similar plot for link rate of 1000 Mbps. The coding benefit, averaged
over oversubscription ratios, is around 56% for 500 Mbps, and 58% for 1000 Mbps.
4.9.2.3 Disk IO Statistics
In this Section we compare proposed approach with vanilla shuffle for two disk I/O related
parameters, i,e., %d-util and Qsz measured at the receiver service instance. Figures 4.24
and 4.25 show that the coding based scheme outperforms vanilla shuffle at different link
rate settings. Furthermore for both Qsz and %d-util, the percentage improvement in
performance between the proposed scheme and vanilla shuffle peaks to more than 39%
at link rate of 1000 Mbps. This trend can be explained by observing that as the link
rate becomes higher the disk I/O at the receiver service instances becomes the bottleneck
which can be compensated by reduction in the volume of communication offered by the
Chapter 4. A Coding Based Optimization for Big Data Processing 85
Figure 4.22: Goodput for different oversubscription ratios using sorting benchmark fortopology Top-1 with link rate at 500 Mbps.
coding based scheme. Figures 4.24 and 4.25 also show the 95% confidence interval for
both vanilla shuffle and proposed coding based scheme.
%d-util and Qsz are recorded for the same duration for both vanilla shuffle as well as
coding based scheme.
4.9.2.4 Bits-per-Joule
Energy proportionality—energy consumption proportional to the load—of a network de-
vice is a desirable feature for a greener big data enterprise. It has been observed that
the energy consumption is proportional to the load on network devices, however in prac-
tice this proportional relationship is not ideal rather devices consumes energy even at
no load (e.g., Figure 2 in [127]). Moreover, the power consumption depends on the link
rate [58, 127,128].
Chapter 4. A Coding Based Optimization for Big Data Processing 86
Figure 4.23: Goodput for different oversubscription ratios using sorting benchmark fortopology Top-1 with link rate at 1000 Mbps.
Figure 4.24: %d-util versus link rates using sorting benchmark for topology Top-1.
Chapter 4. A Coding Based Optimization for Big Data Processing 87
Figure 4.25: Qsz versus link rates using sorting benchmark for topology Top-1.
Adaptive Link Rate(ALR) [128, 129], and traffic consolidation are two of the pop-
ular techniques, exploiting energy proportionality, to optimize energy consumption in
a network device. In ALR network devices tune their network interfaces to different
rates and/or shutdown some of them [128, 129]. In traffic consolidation based schemes,
traffic from multiple links is consolidated to a smaller number of links while turning
off the unused interfaces [30, 51]. Our proposed scheme can lead to energy savings by
deploying traffic consolidation and ALR—coding can reduce volume of communication
consequently reducing traffic on the interfaces and creating opportunities for turning off
some interfaces and/or tuning them to lower rates.
In this section we compare BpJ for the proposed approach with vanilla shuffle. We
compute BpJ for a fixed link rate. Moreover, the links were fully utilized. More precisely,
we investigated possible improvement in energy efficiency by changing Open vSwitch link
rates in the testbed and using power rating given in [128] to calculate BpJ. Specifically,
a k% reduction in volume of communication using coding based scheme can correspond
to turning off that port for up to k% of time as compared to vanilla shuffle. Note that
coding based shuffle has less amount of data to transfer compared to vanilla shuffle as
Chapter 4. A Coding Based Optimization for Big Data Processing 88
Figure 4.26: BpJ versus link rates using sorting benchmark for topology Top-1.
shown in our results in Section 4.9.2.1.
Figure 4.26 shows BpJ for different link rates. The improvement in BpJ of coding
based scheme over vanilla shuffle grows significantly with higher link rates (i.e., 2.5, 4.5,
and 187.2 more BpJ at link rates of 10, 100, and 1000 Mbps respectively).
In general by choosing lower link rates, favouring lower power (energy per unit time)
consumption, the corresponding GP also drops as much more time is elapsed in network.
We observe that the benefit of choosing lower link rate to improve power efficiency might
not be an energy efficient solution for certain scenarios as deduced by the results of Figure
4.26 that the BpJ grows with the increase in link rate.
4.10 Summary and Discussion
We introduce the novel concept of spate coding for reducing the volume of commu-
nication, without degrading the rate of information exchange, in big data processing
frameworks. We have introduced for the first time to our knowledge a network middle-
Chapter 4. A Coding Based Optimization for Big Data Processing 89
box service that employs spate coding to optimize network traffic. We have not only
analyzed the computational complexity of a general schemes for minimizing the volume
of communication, but also provided the bounds on maximum advantage and placement
strategies. We have also performed a proof-of-concept implementation in real world prac-
tical scenarios. Moreover, we performed several experiments to investigate the purposed
scheme in a holistic fashion including the parameters capturing both the network and the
host nodes. The performance indicators including volume of communication, goodput,
bits per Joule, disk utilization, and queue size showed the performance of the proposed
system across a spectrum of scenarios and applications. The improved goodput is an
indicative of a greener framework plane, whereas improved disk utilization and queue
size reflect a greener server plane and virtualization plane. Similarly, decreased volume
of communication corresponds to a greener interconnect plane.
Although this chapter discusses the proposed scheme in context of Hadoop MapRe-
duce, however, the proposed scheme can be extended to multistage multitask frameworks
by treating each group of tasks communicating with each other individually, or assigning
a virtualized appliance to each subgroup of associated tasks. Furthermore, in multi-
stage scenarios where the next stage of a job starts after completion of the first stage,
the output of first stage can serve as a superior set of side information for the next
stage and can result in better performance as well as less reliance on the sampler. Ex-
tending the current scheme and model for more futuristic scenarios that are not related
to MapReduce is an interesting venue for future research. However, the three proposed
components sampler, coder and preReducer are scalable for the Hadoop MapReduce with
large number of mapper/reducer tasks. In particular, preReducer is a component that
is individual to each physical node—hosted locally and operating independently—and
therefore the number of tasks/nodes does not pose a scalability challenge. The sampler
and the coder components, residing in the network middlebox, are limited in number
and dealing with large number of tasks efficiently might need some planning. One of
Chapter 4. A Coding Based Optimization for Big Data Processing 90
the possibilities can be to dissolve solving the spate coding problem, an approach similar
to divide-and-conquer, for all tasks into a number of sub-problems each dealing with a
practical-sized group of tasks. Each subproblem can either be assigned to an individual
middlebox appliance or a virtualized appliance. The scalability also highly depends on
the hardware used for middlebox. For instance, an FPGA-based middlebox might be
able to process even a very large number of tasks, eliminating the need of subdividing
the problem. On the other hand using one of the nodes as a middlebox device might need
sophisticated handling, like subdividing and/or scheduling the subproblems, to keep up
with the enormous workload.
4.11 Connecting the Dots
We want to reemphasize that turning the big data enterprise into a greener enterprise
requires two fundamental steps. The first step is curtailing resource consumption, and
making data centers energy efficient. The second step is using greener energy sources
to power data centers. This chapter provided a coding based scheme that was focussed
on making data centers greener and energy efficient. The next chapter deals with the
greener power procurement for the workhorse of big data enterprise, i.e., data centers.
Chapter 5
Distributed Power Procurement for
Data Centers
As discussed in Chapter 1, a greener big data enterprise requires both the processing
work-horses as well as the power procurement to be green. Data centers consume around
3% of the world’s total electricity, whereas renewable energy sources account for no more
than 10% of the generation mix. However, with the advent of smart grid, there is a
paradigm shift in the power procurement, e.g., it is expected that around 50% of all the
power procured for the data centers would be from greener sources [4]. A major feature
of this new paradigm is association of the electricity pricing with the geographical and
temporal diversity of the sources. Uncertainties associated with the electricity markets
and the integration of the greener sources lead many data center operators to buy energy
in bulk at locked prices for extended lengths of time, from “ungreen” sources, which
in certain scenarios cost more. Therefore, there is a growing energy-procurement trend
which enables data centers to realize better and greener energy mix while lowering energy
costs [131,132].
Moreover, data center operators tend to house multiple facilities across geographical
0Part of this work appears in [130].
91
Chapter 5. Distributed Power Procurement for Data Centers 92
locations and workload redistribution across different facilities is a common practice by
data centers. Among the factors that are driving data centers to geographically diver-
sify locations across multiple regions are globalization, security, disaster recovery, and
minimizing downtime [133]. The possibility of distributed facilities at various locations
can benefit from diversity of geographical generation mix [134] as well as electricity
prices [135]. Additional rising scenarios like policy directives and transmission line bot-
tlenecks are adding to the complexity [17].
This chapter focuses on the power procurement problem incorporating emerging pol-
icy and transmission line capacity constraints. We use the concept of matroid theory to
model the problem as a combinatorial problem, and propose an optimal and distributed
solution for it. Furthermore, we account for communications and computational complex-
ity and demonstrate how although the power procurement problem is NP-hard, simpler
versions of the problem lead to polynomial-time solutions.
Note that the power procurement problem modeled in this chapter is general in na-
ture and is not specific to only data centers. In fact it captures a more general problem
of scheduling power generation to a set of load areas exhibiting the policy related con-
straints.
5.1 Introduction
Power procurement from a variety of sources to meet load demands while minimizing
operating costs has historically been a cardinal problem in the industrial world. In its
classical settings, the power procurement involves the selection of generating sources to
supply forecasted load that minimizes cost over a required planning horizon. With the
rise of big data, energy consumption of data center facilities are sky-rocketing resulting
in millions of metric tons of CO2 generation. However, recent environmental awareness
is driving the data center industry to adopt greener sources.
Chapter 5. Distributed Power Procurement for Data Centers 93
Furthermore, as geographically diverse data centers marry “smarter” cyber-physical
grid in a deregulated market, we witness a fluctuating technical, political and financial
landscape in which decisions are constrained by additional issues including policy and
transmission line bottlenecks. For instance, although it may take one year to build a wind
farm, it can take five years to build the necessary transmission lines needed to carry its
power to cities [136]. Moreover, the regulation are pushing data centers to incorporate
green-practise driven policies, e.g., U.S Energy Efficiency Improvement Act of 2014 (HR
540) bill [17], resulting in novel system constraints. Such policies can dictate constraints
on the choices of energy sources for various facilities in the shared distribution system.
In this vein, the objective of this chapter is to explore the interaction of power pro-
curement problem structure with complexity in context of emerging constraints. For this
reason we focus on the fundamental aspects of the constrained power procurement prob-
lem in order to more effectively identify the the trade-offs and relationship to complexity.
In particular, we consider a power procurement problem in which we take into account
a subset of constraints from its classical counterpart while exploring the integration of
new requirements. We develop a modified formulation that lends itself to a better un-
derstanding of issues of complexity and tractability. Within this formulation we explore
solution complexity to ask: Can we solve the general problem in polynomial time? If
not, what makes the structure of the problem NP-hard? For what special cases can we
find efficient solutions? We emphasize that we aim to asses insight into cyber-physical
constraints and complexity to aid future opportunities for evolution.
We specifically address the following physical system requirements: 1) policy con-
straints which deal with prioritization of generation source classes for specific facilities,
2) transmission line constraints involving line overloads that arise from usage by multiple
generating sources sharing the same transmission lines, and 3) delay constraints including
startup lags of the generation sources. We call this the constrained power procurement
(CPP) problem. We associate a time-varying cost of operation to each generating source
Chapter 5. Distributed Power Procurement for Data Centers 94
to reflect the temporal diversity of electricity prices.
An instance of the CPP problem consists of a number of generating sources, a set of
transmission lines, and a set of data center facilities. Each generating point can represent
a collection of sources. Each transmission line has a known capacity limit. Each data
center facility has a forecasted demand and a set of policy-driven generation preferences.
An example follows.
5.1.1 Example Case Study: Power Procurement over a 12 bus
power system
Consider three data center facilities A1, A2, A3, that we revisit throughout this chapter,
connected to generating sources g1, g2, · · · , g8 over a 12-bus power system, and a set of
two bottleneck transmission lines γ1, γ2 shown bold in Fig. 5.1. The sum from any group
of two or more generating point outputs exceeds the transmission line capacities of γ1
and γ2 and therefore prohibits simultaneous use by more than one generating point. A
careful study reveals the following facts.
If selected for procurement, generating sources Gγ1 = g1, g2, g3, g8 need to use one
common transmission line γ1 and thus cannot be simultaneously procured. Similarly,
Gγ2 = g4, g5, g6, g7 cannot be simultaneously used for procurement by using γ2. Thus
the transmission line constraints impose that no two generating sources from Gγ1 and
Gγ2 be scheduled to use γ1 and γ2, respectively, at the same time. Moreover, due to
policy matters each data center facility has preference of generating sources. In this
example, A1 chooses GA1 = g1, g2, g3, A2 opts for GA2 = g4, g5, and area A3 must
fulfill demand with GA3 = g6, g7, g8. Thus, within the scope of the planning horizon,
the transmission line and policy constraints dictate selection of only one generating point
from each of GA1 , GA2 , and GA3 for data center facilities A1, A2 and A3 respectively.
Chapter 5. Distributed Power Procurement for Data Centers 95
Figure 5.1: Three data center facilities A1, A2, A3 and a 12 bus power system with eightgenerating sources g1, g2 · · · , g8. The constrained transmission lines, γ1 and γ2, are shownin bold.
Chapter 5. Distributed Power Procurement for Data Centers 96
Figure 5.2: An equivalent representation of the imposed constraints for the 12 bus casestudy shown in Figure 5.1.
Figure 5.3: Startup delays and generation cost associated with each generating point foreach time unit t0, t1, t2 and a planning horizon of three time units for the 12 bus casestudy shown in Figure 5.1.
Chapter 5. Distributed Power Procurement for Data Centers 97
In this example, we assume that any generating source within GAican accommodate
the forecasted load for Ai; if that is not the case then more than one generating point
from the set can be selected. A three-tier view of the constraints is shown in Fig. 5.2,
where generating sources are connected by lines to show the data center facilities they
can serve and the shared transmission line they need to couple to. Moreover for each
generating source the associated startup delays and the generation cost for a planning
horizon of three time units t0, t1, t2 is shown in Fig. 5.3.
5.2 Related work
The related work can be broadly categorized into three thrusts, namely power procure-
ment for data centers, distributed generation, and generator scheduling.
The joint evolution of big data industry and smart grid invites a new era of power
procurement for data centers. The current efforts in power procurement for data centers
can be subdivided into following categories. The first category deals with making best use
of local renewable power sources like wind, solar, battery banks, and backup generators
[137–139]. The second category focuses on geographical distribution of workloads and
geographical or temporal price diversity [134, 135, 140]. The third category focuses on
demand response by workloads scheduling incorporating the dynamic nature of electricity
pricing by adjusting power usage patterns. Some of the demand response strategies also
take into account the use of renewables and local energy sources [141–143]. The fourth
category focuses on improving the efficiency and management of hardware and software
resources [144,145].
We want to point out that the common objective of all four categories is reducing
electricity bills. However, we assert that aside from lowering the electricity cost, the
emerging scenarios require a power procurement practice that incorporates distributed
generation in the presence of policy, transmission line, and generation delay constraints.
Chapter 5. Distributed Power Procurement for Data Centers 98
Incorporation of these constraints is not explored well by the research community. In
comparison, in this work we consider power procurement in the presence of several physi-
cal constraints imposed by the power grid, as well as policy constraints from a data center
operator’s perspective, while minimizing the electricity cost. In this context, to the best
of our knowledge, it is the first work that captures the aforementioned constraints while
lowering the electricity costs.
The distributed generation thrust can be subdivided into the following directions.
From the perspective of active distribution networks, scheduling of distributed energy
resources with respect to the network topology has been considered for the optimal co-
ordination of energy resources [146]. Reference [147] presents production cost minimiza-
tion and network constraint management for active management of distributed energy
resources. Distributed generation (DG) has been explored in terms of compact integrated
energy systems that use 25% less semiconductors in [148]. Control techniques for inte-
gration of DG resources into the electrical power network have been studied in [149].
Furthermore, [150] studies the dynamic allocation of the resources in the dispatch of DG
using an evolutionary game theoretical approach for optimal and feasible solutions in a
microgrid structure. Moreover, the placement of generators in DG for loss reduction in
primary distribution network has been studied in [151]. [152, 153] present a distribution
optimal power flow model to integrate local distribution system feeders into a smart
grid. Reference [154] presents a jump and shift method to handle the multi-objective
optimization in the smart energy management systems, where the goal is to minimize
the operating costs by reducing the emission levels while meeting the load demands.
In contrast, the primary objective of our work is to explore how the emerging policy
and transmission line constraints can significantly affect the complexity of the power
procurement problem in the presence of distributed generation, which has not been ex-
plored previously in a holistic fashion. In this vein, we present a novel approach based
on matroid theory to provide a better insight into the problem.
Chapter 5. Distributed Power Procurement for Data Centers 99
A counterpart of this problem is the generator scheduling problem, where a set of
generators are committed to meet the load demands. Intelligent generation scheduling
(involving unit commitment [155] and economic dispatch [156]) has been widely studied
but classically generator scheduling has not accounted for policy and transmission line
constraints together often leaving its inclusion to the next stage of economic dispatch.
However, the growing trend in power systems is one in which these limitations must be
accounted for early on during the scheduling phase to provide a more comprehensive
view [157]. Therefore, research has been done to solve the generator scheduling problem
for real-life large-scale power systems (e.g., see [158]) and also for the generator scheduling
in presence of the security constraints [159] and the transmission line capacity constraints
[160]. However, the growing emphasis on environmental responsibility calls for inclusion
of policy based constraints. Generation scheduling taking into account operational policy
in terms of number of controls action have been studies in [161,162]. Similarly, [163,164]
explore maximization of social welfare and management of environmental impact issues.
A generalized formulation for artificial intelligence based intelligent energy management
minimizing operation cost and the environmental impact of a microgrid has been explored
in [165].
In contrast, the emerging green scenarios call for policy initiatives which has not been
addressed before, e.g., the option of explicitly selecting from the energy sources, which
is addressed in our work.
5.3 Contributions
Power procurement to match demands of the geographically distributed data center fa-
cilities while simultaneously meeting constraints involves extensive combinatorics due to
the discrete nature of the problem; since a generating source is either selected or not.
An alternative approach could relax the integer constraints imposed by the problem and
Chapter 5. Distributed Power Procurement for Data Centers 100
solve it via linear programming; however, such a solution may lack optimality and in some
cases rounding might result in an infeasible solution. Thus, in this work we propose an
approach for combinatorial optimization, suitable for distributed computing, that aims
to leverage the natural symmetry and structure of the CPP problem constraints while
providing insight on issues of computational complexity.
Our contributions are three-fold:
1. We develop a framework for CPP in which the problem is broken down into building
blocks effective in relating procurement complexity to physical system constraints;
specifically, we introduce the 1-restricted constrained power procurement (1-RCPP)
and the T -restricted constrained power procurement (T-RCPP) problems.
2. We present distributed polynomial time solutions to the 1-RCPP and T-RCPP
problems. The 1-RCPP problem assumes zero delay and a planning horizon of one
time unit thus dealing only with transmission line and policy constraints, which can
be represented as matroids. The T-RCPP problem extends the 1-RCPP problem
to a planning horizon of T time units and includes delay constraints. Moreover, our
distributed solution proposes a novel way for assignment of IDs to the generating
sources to facilitate the distributed solution.
3. We show, how for less restrictive physical constraints, the CPP problem is NP-hard
to approximate within a ratio of n1−ε for any constant ε > 0.
We want to point out that one of the primary objectives of our work is to explore
how the emerging constraints can significantly affect the complexity of the problem.
Therefore, we deviate from classical formulation to consider a new approach that allow
us to use matroid theory to get a better insight into the problem. We believe that only
through studying a problem that deviates form the classical formulation shall help shed
light on the interaction between constraints and complexity.
Chapter 5. Distributed Power Procurement for Data Centers 101
5.4 Problem Formulation
We consider a set of k data center facilities A1, · · · , Ak, with forecasted load demand
DAifor facility Ai, that can procure power from a set of n sources G = g1, · · · , gn,
and a planning horizon T . Each source is associated with a variable uijt ∈ 0, 1, where
uijt = 1 means that generating unit gi is scheduled for procurement for area Aj at time
t. A generator gi generates power Wit at time t such that Wmini ≤ Wit ≤ Wmax
i .
There are q buses B = b1, · · · , bq, let ∧bi be the set of indices of the generating
sources at bus bi, andDbi,t be the forecasted demand at bus bi in time period t. We assume
` transmission lines Γ = γ1, · · · , γ` and transmission capacity Fγi for each transmission
line γi. Let τγi,bj be the line flow distribution factor at the transmission line γi by the
bus bj. For the case study of Section 5.1.1 the corresponding set of generating sources
is G=g1, g2, . . . , , g8, the set of (bottleneck) transmission lines is Γ=γ1, γ2, and the
data center facilities are A1, A2, and A3.
Each generating source is associated with a time-dependent generation cost c : G ×
0, 1, · · · , T − 1 −→ R ≥ 0 that assigns non-negative costs to each generating point gi
over all time units tj ∈ 0, 1, · · · , T −1 such that c(gi, tj) refers to the cost of generation
cost of gi. In addition, each generating source gi is associated with a generation delay
(advanced notification) d(gi) ∈ 0, 1, · · · to reflect wait times in startup. If a data center
facility has finalized gi at tj in its procurement plan, the actual time when gi couples to
the transmission line is d(gi) + tj.
Furthermore, we define a vector z of length k, where zj, associated with facility Aj,
is defined as below:
zj =
1 if∑
gi∈sj Wit uijt−d(gi) = DAj
0 otherwise.(5.1)
An allocation at time ti, denoted by ati ⊆ G, is a set of generating sources powering
the facilities at time ti. A power procurement schedule is defined to be the set of alloca-
Chapter 5. Distributed Power Procurement for Data Centers 102
tions at times 0, 1, · · · , T − 1 for a planning horizon of T . A feasible power procurement
schedule is defined to be a schedule that satisfies the following three constraints:
1. Policy constraints regarding meeting the demand constraints of the data center
facilities while respecting policy initiatives of each facility. More specifically, each
facility Ai is associated with a set si ⊆ G of generating points opted by it to meet
its demand.
2. Transmission line constraints captured by the dc power flow model.
∑bj
τγi,bj
∑k∈∧bj
uktWkt −Dbj ,t
≤ |Fγi |, ∀γi ∈ Γ, t ≤ T
3. Delay constraints whereby in a feasible allocation and for a planning horizon T ,
a generating point gi can only be allocated its required transmission line at time tj
if d(gi)+tj ≤ T−1. Delay constraint captures the minimum downtime or advanced
notification of a generating point.
The cost of an allocation ati is defined to be the sum of the generation costs of all the
generating sources in ati given by: cost(ati) =∑
gj :gj∈atic(gj, ti). Subsequently, the cost of a
power procurement schedule Q is given by the sum of costs over all its allocations given
by: cost(Q) =∑
ati :ati∈Qcost(ati).
Definition 30 (Constrained Power Procurement (CPP) Problem): For an instance of
the constrained power procurement problem with policy, transmission line, and delay con-
straints as defined above, find a least cost feasible power procurement schedule; that is,
no other feasible schedule has a lower cost.
The objective of the CPP problem is to power up the maximum possible number of
facilities while respecting their policy initiatives without violating the policy, transmission
Chapter 5. Distributed Power Procurement for Data Centers 103
line and delay constraints such that the solution has the least cost amongst all constrained
possibilities. The CPP problem can be summarized as below:
minT−1∑t=0
k∑j=1
n∑i=1
uijt c(gi, t) max (||z||1) (5.2)
subject to∑gi∈sj
Wit uijt−d(gi) ≤ DAj, ∀Aj, t (5.3)
k∑j=1
uijt ≤ 1, ∀ gi, t (5.4)
∑bj
τγi,bj
∑`∈∧bj
k∑m=1
u`mt−d(gi) W`t −Dbj ,t
≤ |Fγi|, ∀ γi, t (5.5)
t+ d(gi) ≤ T − 1, ∀ uijt = 1 (5.6)
zj =
1 if∑
gi∈sj Wit uijt−d(gi) = DAj
0 otherwise.∀Aj, t (5.7)
Note that Equations 5.3, 5.5, and 5.6 deal with the policy, transmission line, and
delay constraints respectively.
5.4.1 Why Matroids
In this section we motivate the selection of matroid theory as a tool for the power
procurement problem. We start by presenting a simple instance of the CPP problem
with six generators G = g1, g2, g3, g4, g5, g6, where c(g1, 0) = 7, c(g2, 0) = 6, c(g3, 0) =
6, c(g4, 0) = 9, c(g5, 0) = 7, c(g6, 0) = 9, T = 1, and d(gi) = 0 for all i. There are three
areas A1, A2, A3, and three transmission lines γ1, γ2, and γ3. The transmission line γi
is connected to the bus bi, where τγi,bi = 1, and Dbi,t = 0. Furthermore, Wi0 = 1,
DAi= 1, and Fγi = 1 for all i. The policy sets are : s1 = g3, s2 = g2, g4, g6, and
s3 = g1, g5. The generators connected to bus b1, b2, and b3 are g1, g2, g3, g5, and
Chapter 5. Distributed Power Procurement for Data Centers 104
g4, g6 respectively. This instance belongs to the Mixed Integer Programming (MIP)
family as given in Figure 5.4.1.
Equations 5.12 - 5.26 capture Equation 5.1.
We start by describing the complexity of the problems belonging to MIP family,
and then focus on some heuristic approaches—e.g., relaxation etc— that might work for
certain scenarios in MIP.
We first note that Integer Programming (IP)/Mixed Integer Programming (MIP) is
not only an NP-hard but also has no known polynomial-time approximation solution since
the Boolean Satisfiability problem can be solved using IP [166–168]. It is so computa-
tionally complex that even for the case where the number of constraints are superlinear
in number of variables n, existence of any algorithm with the complexity O(2(1−s)n) for
s > 0, contradicts the well-known Strong Exponential Time Hypothesis (SETH) stating
that k-SAT does not have a subexponential time algorithm [169, 170]. In fact, IP/MIP
are so general that virtually any combinatorial optimization problem can be transformed
into IP/MIP [171]. Furthermore, an NP-hard problem with no known polynomial-time
approximation solution, e.g., IP/MIP, is believed to have exponential computational com-
plexity [166], and is not tractable [172].
The commonly used approaches to solve IP/MIP include branch-and-bound and re-
laxation. Solution techniques based on branch-and-bound methods have exponential
complexity [173]. Furthermore, an approach based on relaxation may not produce a
feasible solution by the means of trivial rounding. For instance in order to solve above
problem by relaxing the constraint on variables (uij0) to take any real value between 0 and
1 and then rounding the solution obtained seems to be computationally tractable, but it
can be shown that such an approach can result in an infeasible solution. For example, the
relaxation of the binary variables uij0 (the last constraint) in the program given in Fig-
ure 5.4.1 yields the following solution: u130 = 0, u220 = 0, u310 = 1, u420 = 0.5, u530 =
1, u620 = 0.5. It is easier to see that either of simple rounding up and down results in an
Chapter 5. Distributed Power Procurement for Data Centers 105
min (7u130 + 6u220 + 6u310 + 9u420 + 7u530 + 9u620) max (||z||1) (5.8)
subject to u310 ≤ 1 (5.9)
u220 + u420 + u620 ≤ 1 (5.10)
u130 + u530 ≤ 1 (5.11)
α1 = u310 − 1 (5.12)
− α1 ≤M1(1− z1) (5.13)
α1 ≤ −m1(1− z1) (5.14)
m1(|α1|+ 1) < 0.1 (5.15)
|α1| −M1 < 0 (5.16)
α2 = (u220 + u420 + u620)− 1 (5.17)
− α2 ≤M2(1− z2) (5.18)
α2 ≤ −m2(1− z2) (5.19)
m2(|α2|+ 1) < 0.1 (5.20)
|α2| −M2 < 0 (5.21)
α3 = (u130 + u530)− 1 (5.22)
− α3 ≤M3(1− z3) (5.23)
α3 ≤ −m3(1− z3) (5.24)
m3(|α3|+ 1) < 0.1 (5.25)
|α3| −M3 < 0 (5.26)
u130 + u220 + u310 ≤ 1 (5.27)
u530 ≤ 1 (5.28)
u420 + u620 ≤ 1 (5.29)
mj,Mj ∈ R+ (5.30)
z1, z2, z3 ∈ 0, 1 ∀ Aj (5.31)
uij0 ∈ 0, 1 ∀ i, j (5.32)
Figure 5.4: An instance of the CPP problem.
Chapter 5. Distributed Power Procurement for Data Centers 106
infeasible solution, specifically rounding up violates the policy constraint, and rounding
down fails to serve the area A2. Note that sophisticated rounding techniques specific
to certain problem formulations might be developed that can possibly ensure feasibility.
Given the fact that the relaxation techniques might be developed specific to a problem
formulation, it is an interesting open question to investigate if such a technique can be
developed for the CPP problem that allows an effective binary-recovery procedure given
the specific form of the constraints in the problem formulation.
Furthermore the solutions for IP/MIP are usually centralized, whereas the mammoth
size of the future power systems and the increased number of data center facilities calls
for a distributed solution [174,175]. Even the distributed solutions for IP/MIP are in fact
parallel algorithms assuming loosely synchronous models [176,177] which are suitable only
for co-located high performance computing clusters (HPC), for example IBM’s CPLEX
on an HPC cluster. These solutions cannot be used for the distributed entities which
are far apart geographically as well as don’t have a reliable communication mechanism
as required by HPC clusters. Note, Information and Communication Technology (ICT)
support in the evolving smart grid is mostly based on asynchronous TCP and unreliable
IP infrastructure, so there is a need of distributed solutions for solving the power pro-
curement asynchronously while taking into account the complexity class of the problem
instance. Furthermore, any efficient distributed solution utilizing an TCP/IP infrastruc-
ture needs to take into account the inherent rate-control limitations of TCP which result
when a huge number of messages is exchanged (packets sent) over the network.
The power procurement problem is a combinatorial optimization problem; choosing
the least cost procurement by utilizing a generate-and-test approach requires processing
O(2|G|) possibilities. A major challenge of solving combinatorial optimization problems
is to develop solutions with acceptably small number of computational steps. An accept-
able standard in the realm of combinatorial optimization is the smartly-crafted problem-
specific solutions that require number of computational steps that are bounded by a
Chapter 5. Distributed Power Procurement for Data Centers 107
polynomial in the size of the problem. Such polynomial time solutions and techniques
exploit the specific structure of the problem. Each of such solutions, usually, has its own
niche where it performs efficiently. On the other hand the solutions and techniques that
try to solve combinatorial optimization problems in general like IP/MIP are not very
efficient as they do not exploit the problem specific characteristics. Moreover, there is no
possibility of establishing bounds on the length of the computations involved in solving
IP/MIP [171]. So, it is important to classify the the problem into different complexity
classes, and then based on the classification use the most efficient solution in terms of
computational complexity.
Given the combinatorial nature of the problem, we have used matroid theory to cap-
ture the variety of constraints faced by evolving power procurement scenarios. The struc-
ture of matroids enables one to effectively rule out large groups of generating choices for
procurement with a simple test in contrast to sifting through all possible combinations.
We use matroid theory to develop an efficient solution with polynomial time computa-
tional complexity. Our formulation enables us to classify the problem at hand based on
computational complexity to identify easy and tough instances of the problem. Moreover,
the theoretical analysis in terms of computational and message complexity of the pro-
posed matroid intersection based scheme is of particular interest in context of designing
industrial scale systems. The analysis presented can be helpful in deciding performance
requirements of system components like CPU/controller, communication requirements
on the interconnection TCP/IP network, etc.
5.5 Restricted Constrained Power Procurement Prob-
lem
In this section we deal with a restricted version of CPP problem, called the restricted
constrained power procurement (RCPP) problem where Wi = Dj,∀gi ∈ sj and Wgi <
Chapter 5. Distributed Power Procurement for Data Centers 108
Fγj < 2Wgi ,∀gi ∈ ∧bk with τγj ,bk 6= 0 (i.e., bus bk injects flow to the transmission line
γj).
Later, in Section 5.6 we show that without this restriction the problem becomes NP-
hard. We start by presenting the solution to the 1-RCPP problem which exhibits the
following additional restrictions:
• T = 1, which naturally implies
• d(gi) = 0 ∀i and
• c(gi, tj) = c(gi) ∀i, j.
It is obvious that the the least cost schedule for the 1-RCPP problem, is same as least
cost allocation at time ti = 0 since the planning horizon is just one time unit. Thus, we
may use the solution to the 1-RCPP problem as a building block for the solution of the
T-RCPP problem with planning horizon T .
5.5.1 Relation to Matroids
To apply matroid theory, we first capture the constraints of Section 5.4 as matroids
defined in Chapter 2.
5.5.1.1 Policy Constraints
We first define a set PC(G) = s1, · · · , sk consisting of distinct (non-overlapping)
policy sets si ⊆ G defined for each facility Ai. For the faculties A1, A2, and A3 in
the case study of Section 5.1.1 the corresponding policy constraint set is: PC(G) =
g1, g2, g3, g4, g5, g6, g7, g8. To capture the policy constraints we start by working
in a k dimensional vector space, with k basis vectors given by b1, b2, · · · , bk. We further
define a matrix A with k rows and n columns. Each generating point gi ∈ G is associated
with a unique column vector zi in matrix A. Each zi is a vector in the k dimensional
Chapter 5. Distributed Power Procurement for Data Centers 109
vector space. To each policy set si ∈ PC(G), we assign a unique base vector bi. More-
over, all the generating points in si are assigned unique column vectors that are positive
(sequential) integer multiples of bi such that they are unique but linearly dependent over
bi. Specifically, if gj, gk, · · · , gy+x ∈ si, then we assign vector zj = bi to gj, zk = 2 ·bi to gk,
and similarly to the remaining generating points using consecutive integer multiples of bi.
Note that this unique vector shall be later used as unique ID for each of the generating
sources in the distributed algorithm presented in Section 5.5.2.
We define L1 = L1 where L1 ⊆ G, such that all the columns vectors zi associated
with gi ∈ L1 are linearly independent for all i; that is, L1 is the family of all subsets of
G whose associated column vectors are linearly independent.
Lemma 31 SM , (G,L1) is a vector matroid over ground set G, with collection of
independent sets given by L1.
Proof: Follows directly from the definition of vector matroid [38].
For our case study of Section 5.1.1 the matrix A is given below:
A =
1 2 3 0 0 0 0 0
0 0 0 1 2 0 0 0
0 0 0 0 0 1 2 3
.
In this example ground set for SM is G = g1, . . . , g8. Furthermore, L1 is the set of
all subsets of G whose associated column vectors are linearly independent, an example
of one such set is g1, g4, g7.
5.5.1.2 Transmission Line Constraints
We define a family of subsets of G, TLC(G) = x1, · · · , xR|xi ⊆ G ∀i where set xi
contains all generating sources that require line γi but have conflicts such that their
total output exceeds the line capacity Fγi . Let L2 be the family of all subsets of G
Chapter 5. Distributed Power Procurement for Data Centers 110
that are feasible with respect to transmission line constraints, that is, they do not have
transmission line conflicts. Formally, L2 = L2|L2 ⊆ L′2, where L′2 ⊆ 2P such that
L′2 = e1, · · · , ek|e1 ∈ x1, · · · , ek ∈ xk and k ≤ `.
Lemma 32 TLM , (G,L2) is a partition matroid over ground set G, with the collection
of independent sets given by L2.
Proof: The proof is based on two observations. The first is that each L2 is a
partial transversal (as defined in Section 2.2) of the transmission line constraint set
TLC(G) = x1, · · · , x`, so L2 is the set of all the partial transversals of the trans-
mission line constraints set TLC(G). The second observation is that the transmission
line constraint set TLC(G) defines a partition of G. Hence, by definition of a partition
matroid [38], TLM is a partition matroid over the ground set G.
For the case study of Section 5.1.1, the ground set for TLM is G. L2 is the set of all
partial transversals of TLC(G), one such component is given by g1, g3.
5.5.2 A Distributed Algorithm for the 1-RCPP Problem
Since we are in the era of distributed generation, a distributed power procurement for
distributed facilities is a perfect fit. This section provides a distributed solution to the
power procurement problem where generating sources collaboratively and distributedly
decide the least cost schedule for powering up the geographically distributed facilities.
It has been shown in the previous section that policy constraints can be captured by
a vector matroid SM = (G,L1), and transmission line constraints can be captured by a
partition matroid TLM = (G,L2). Moreover, both the constraints need to be satisfied
for finding a feasible solution for the 1-RCPP problem.
Specifically, the 1-RCPP problem can be solved by finding the maximum cardinality
intersection of L1 and L2 (to maximize facilities powered up under constraints) with least
cost.
Chapter 5. Distributed Power Procurement for Data Centers 111
Lemma 33 The optimal solution to 1-RCPP is given by the least cost maximal cardi-
nality common independent set of matroids SM and TLM.
The proof of Lemma 33 is given in Appendix C.1.
We first give a brief overview of the proposed distributed algorithm. Each generating
executes executes a copy of Algorithm 1-RCPP based on Edmond’s weighted matroid
intersection algorithm [38,171,178].
For the sake of simplicity, we assume that each generating source can communicate
with any other generating point (via a possible multi-hop or mesh network). Moreover,
each generating source gi has a unique vector ID idi that consist of two parts. The first
part id1i corresponds to the vector zi assigned to it (used for capturing policy constraints
by using a vector matroid representation as mentioned in section 5.5.1). The second
part id2i indicates the transmission line it must couple with; for instance, the second
part of idi for the generating point gi is j if gi ∈ xj. For our case study the ID for
g1 is id1 = 1001 where id11 = z1 = 100 and id21 = 1 since g1 ∈ x1. We assume each
generating point knows the IDs of all the other generating points. The communication is
done in rounds. Each round is subdivided into three phases denoted Phase 1, Phase 2,
and Phase 3. In each round, a set of the generators selected so far is denoted by Gon.
Using algorithm ChkL, each generator independently computes the sets G1, G2,−→E and
←−E . Sets G1 and G2 are the set of generators that can be added to the set of generators
selected so far without violating the policy, and transmission line constraints respectively.
The sets←−E and
−→E represent sets of valid replacements for the generators selected so far
with respect to policy, and transmission line constraints respectively. The algorithm is
initiated by an external call init(). Each generator maintains a vector Gon of length n
indicating the selection of a generator. The updated vector is communicated at the end
of the round. Gon is updated in each round till we get the optimal solution. Initially
no generator is selected, and each generator has a non-negative cost. During phase 1
Chapter 5. Distributed Power Procurement for Data Centers 112
of each round, the generator gi will compute two vectors←−E , and
−→E each of length n,
indicating a valid replacement w.r.t. L1 and L2. A valid replacement means that if
instead of gi, gj is selected in the set Gon then the resulting selection would still represent
a valid independent set with respect to the corresponding matroid. Computation for the
sets←−E , and
−→E are done locally, with the help of the identification vectors associated
with the generators, using the subroutine ChkL(g,G′, num) given in Figure 4, where
num ∈ 1, 2. Given a set of selected generators G′ and a generator g, the subroutine
ChkL(g,G′, num) checks if g∪G ∈ Lnum. For instance ChkL(g,G′, 2) checks if g∪G ∈ L2.
If a generator gi is not selected then it proceeds to determine two binary vectors G1 and
G2, each of length n, by invoking ChkL(gi, Gon, 1) and ChkL(gi, Gon, 2) respectively. The
vectors G1 and G2 represent the generators that can be selected alongwith the already
selected generators and still belong to the independent sets L1 and L2 respectively. If gi is
selected in G1 then it proceeds to phase 2 of the round where it finds the least cost path,
in a distributed manner, among all (gk, gj) pairs of generators such that gk is selected
in G1 and gj is selected in G2. Note that vectors−→E and
←−E are used as routing tables,
informing gi about the outgoing neighbors (incoming next hops) and incoming neighbors
(outgoing next hops) respectively. Also note, these next hops might not necessarily be
the physical next hop neighbors in the actual network, rather this is a just a virtual tag
used by gi. A least cost path will be the path in which the sum of cost of all the generators
traversed by it is the least . Initially the generator gi find the least cost path from itself
to gm, Ggi gm with cost cim. Note that cim can also be zero which happens when gi is
selected in both G1 and G2, (i.e., gi finds a path to itself). Moreover, among all (gi, gj)
pairs of generators such that gj is selected in G2 that are connected through least cost
cim path, gm is also least hop count him away from gi. gi then exchanges messages with all
other generators selected in G1 about its discovery of the least cost cim and hop count him.
After receiving information about least cost and least hop count from all other generators
selected in G1, it finds the minimum cost and least hop count path among all the (gk, gj)
Chapter 5. Distributed Power Procurement for Data Centers 113
Algorithm 1−RCPP (G) Code for gi, 1 ≤ i ≤ n1 When init():2 Gon(g) = 0 ∀ g3 c(gi) = c(gi)4 For each round
Phase 1:
5←−E (g)=0,
−→E (g)=0, G1(g)=0, G2(g)=0 ∀ g
Compute routing table6 If Gon(gi) = 1;7 For each gj such that Gon(gj) = 0
8 G′ = Gon
9 G′(gi) = 0
10−→E (gj) = ChkL(gj , G
′, 1)
11←−E (gj) = ChkL(gj , G
′, 2)12 Else13 For each gj such that Gon(gj) = 1
14 G′ = Gon
15 G′(gj) = 0
16←−E (gj) = ChkL(gi, G
′, 1)
17−→E (gj) = ChkL(gi, G
′, 2)
18 For each gj such that Gon[gj ] = 0
19 G1(gj) = ChkL(gj , Gon, 1)
20 G2(gj) = ChkL(gj , Gon, 2)Phase 2:
21 If G1(gi) = 122 For all G2(gj) = 1
23 Find a least cost path from gi to gj using routing tables←−E and
−→E
24 Let gm be the generating sources connected through least cost path Ggi gm with cost
cim and hop count him25 Send(< cim, h
im >) to every generating point gj with G1(gj) = 1
26 Upon Receive(< cjm, hjm >) from every processer gj with G1(gj) = 1
27 c∗m= mink|G1(gk)=1
(ckm). If ckm = c∗m for more than one generating point then choose the
generating point with least hop count h∗m. LetG∗gi′ gm′ be the least cost path among all paths Ggk gj such that G1(gk) = 1 and
G2(gj) = 1Phase 3:
28 If c∗m = infinity29 Then Gon is a maximum-cardinality common independent set. Send(< stop >)to all generating sources
30 Else-If gi′ = gi31 Let the path G∗gi′ gm′ = y0, z1, y1, · · · , zl, yl.32 G′on(gk) = 1 such that gk ∈ g|Gon(g) = 1 \ z1, · · · , zl ∪ y0, · · · , yl33 Send(< G′on >) to all generating sources34 When Receive(< G′on >):35 Update Gon
36 if Gon[gi] = 137 c(gi) = −c(gi)38 Else39 c(gi) = c(gi)40 goto step 441 When Receive(< stop >):42 If Gon[gi] = 1, then start working.
Figure 5.5: Algorithm 1−RCPP
Chapter 5. Distributed Power Procurement for Data Centers 114
pairs of generators such that gk is selected in G1 and gj is selected in G2. If more than one
generators have same value for least cost path cm then identifications are used to break
the ties. Let this least cost and least hop count path be G∗gi′ gm′ = y0, z1, y1, · · · , zl, yl
with cost c∗m and hop count h∗m, where g′i be the generator which is a source in the least
cost path. Here y0 = gi′ such that gi′ is selected in G1, and yl = gm such that gm is
selected in G2. The generator gi proceeds to phase 3 if gi = gi′ (i.e., the shortest path
starts from gi ) and informs all the generators about the new selection of generators G′on
and every generator that receives this message will update its Gon vector accordingly. If
generator gj is among the updated list of selected generators in Gon then it sets its cost
to be −c(gj) otherwise it set its cost to be c(gi). The algorithm then proceeds to the
next round and repeat the same steps until there exist no shortest path or c∗m is infinity
then g′i will send a stop messages to every one indicating that final selection has been
made and all the selected generators can start working.
The selection of generating points in the optimal solution needs at most ` rounds; this
is because the exact number of rounds needed is given by the maximum possible areas
served which is bounded from above by `.
The formal Algorithm 1-RCCP for generating point gi is shown in Fig. 5.5.
Algorithm b = ChkL (gk, G′, num)
1 b=1;2 If num=13 For gj with G′[gj ] = 1
4 if id1k and id1j are linearly dependent
5 Return b=0;6 If num=27 For gj with G′[gj ] = 1
8 if id2k=id2j9 Return b=0;
Figure 5.6: Algorithm b = ChkL
Theorem 34 The distributed Algorithm 1-RCPP gives an optimal solution to the 1-
Chapter 5. Distributed Power Procurement for Data Centers 115
RCPP problem.
The proof of Theorem 34 is given in Appendix C.2.
5.5.3 Complexity Analysis
The complexity of Algorithm 1-RCPP is measured in terms of the number of messages
exchanged M and the local computation time at each generating point Tlocal.
The number of messages sent across the network in Phase 2 is incurred in two steps.
One, in computing the shortest path from all generating sources in G1 to all generating
sources in G2, which in worst case can turn out to be all pairs shortest paths. The
complexity in terms of the number of messages for the all pairs shortest paths algorithm
given by S. Halder [179] is 2n2 messages. Two, in the worst case if |G1| = |G2| = n,
the number of messages sent is n2. Whereas in Phase 3 the source node of the shortest
selected path sends nmessages one to each generating source. Therefore the total message
complexity for the three phases of a round is O(n2). There can be at most ` rounds.
Therefore the total number of messages exchanged is M = O(` n2).
The time complexity is incurred at the following steps of each round. The time to fill
the routing tables using subroutine ChkL is O(n2), as each generating point checks its
neighborhood at most with all other generating sources; precisely checking for one entry
in vector−→E , and
←−E requires O(n) steps. Similarly, computing entries of vectors G1, and
G2 also requires O(n2) computations each. Step 24 and 27 require the identification of
generating sources with minimum distance and hops, which necessitates at most O(n2)
computations. Therefore in each round, the total local time complexity at each generating
source is at most O(n2) and the total complexity in all rounds Tlocal = O(` n2).
Chapter 5. Distributed Power Procurement for Data Centers 116
5.5.4 T-Restricted Constrained Power Procurement Problem
In Section 5.5 we presented a solution for the 1-RCPP problem. In this section we extend
this solution to the T-RCPP problem. We start by extending matroids SM and TLM
to capture time-varying generation costs c(gi, tj) and generating point delays d(gi). This
extension is based on the concept of time expanded matroids [180].
5.5.4.1 Extended Ground Set
We start by extending the ground set G to the extended ground set EG.
EG , (gi, tj)|gi ∈ G, tj ∈ 0, · · · , T − 1, d(gi) + tj ≤ T − 1.
EG consists of all generating sources and their possible allocation times represented
as a set of pairs (gi, tj) that satisfy the delay constraints. For the case study given in
Section 5.1.1, the extended ground set is: EG = (g1, 0), · · · , (g8, 0), (g1, 1), (g2, 1), (g3, 1),
(g6, 1), (g8, 1), (g2, 2), (g6, 2)).
5.5.4.2 Policy Constraints
We extend SM, the matroid capturing the policy constraints as follows. First, with each
(gi, tm) ∈ EG we associate a column vector zi. This vector is the same column vector zi
as that associated with gi discussed in Section 5.5.1.1. We define, EL1 = EL1|EL1 ⊆
EG such that all the (gi, tm) ∈ EL1 have linearly independent associated column vectors
for fixed time. In other words, EL1 is the family of all subsets of EG whose associated
column vectors are linearly independent.
We define: ESM , (EG,EL1).
For the case study given in Section 5.1.1 an example of a set EL1 belonging to EL1 is
EL1 = (g1, 0), (g4, 0), (g7, 0), where the column vectors associated with (g1, 0),(g4, 0),
and (g7, 0) are 100, 010, and 002 respectively.
Chapter 5. Distributed Power Procurement for Data Centers 117
5.5.4.3 Transmission Line Constraints
We extend TLM, the matroid capturing the transmission line constraints as follows.
Define, EL2 = EL2|EL2 ⊆T−1⋃j=0
EL2tj, EL2tj
⊆ EG such that for each (gi, tm) ∈ EL2tj
d(gi) + tm = tj, and for all (gi, tm), (gr, tp) ∈ EL2tjgi and gr satisfy the line constraints.
Thus, each EL2tj is the family of subsets of EG which can use their required trans-
mission line at time tj, without violating the line constraints.
We define: ETLM , (EG,EL2).
For the case study given in Section 5.1.1, an example of a set EL2 belonging to EL2 is
EL2 = (g3, 0), (g6, 0), (g4, 0). Note that (g3, 0), (g6, 0), and (g4, 0) all require the same
transmission line, that is, γ1 but they will actually couple to the transmission line at time
slot 1, 2 and 0, respectively, and therefore do not violate the transmission line constraint.
Lemma 35 Both ESM and ETLM are matroids. Furthermore, the optimal solution to
the T-RCPP problem is given by the least cost maximum cardinality common independent
set of matroids ESM and ETLM.
Proof: The fact that both ESM and ETLM are matroids and follows directly
from the concept of time expanded matroids [180]. Then using the similar arguments as
in the proof of Lemma 33 it can be shown that the intersection of ESM and ETLM
satisfies the policy, transmission line and generation delay constraints, and the optimal
solution to the T-RCPP problem shall be the least cost maximum cardinality common
independent set of ESM and ETLM.
A sketch of solution for the T-RCPP problem, Algorithm T-RCPP, is shown in
Fig. 5.7. In Algorithm T-RCPP a generating source gi runs multiple copies of Algorithm
1-RCPP, one for each (gi, tj) ∈ EGgi where EGgi represents copies of gi at different time
slots. For the case study given in Section 5.1.1, the execution of Algorithm T-RCPP
would schedule generating sources g3, g4 and g6 all at t = 0 for power procurement. The
sample execution of Algorithm T-RCPP for the case study given in Example 5.1.1 is
Chapter 5. Distributed Power Procurement for Data Centers 118
Algorithm T−RCPP () Code for gi, 1 ≤ i ≤ n1 Generate
EGgi , (g, tj)|g = gi, tj ∈ 0, · · · , T − 1, d(gi) + tj ≤ T − 1
2 For each generating source (gi, tj) ∈ EGgi , assign a vector ID id(i,j) as follows:
3 id1(i,j) = zi
4 id2(i,j) = k such that gi ∈ xk,
5 id3(i,j) = j + d(gi)
6 Broadcast id(i,j) for all (gi, tj) ∈ EGgi to all the other generating sources
7 For each (gi, tj) ∈ EGgi , call Algorithm RCPP with the following modification in subrou-tine ChkL given in Fig. 3.8 Step 8 will be modified to “If id2k=id2j AND id3k=id3j”
Figure 5.7: Algorithm T−RCPP
shown in Figure 5.8. The Algorithm T-RCPP takes three rounds. The edges indicate
the valid replacements of the selected generating sources w.r.t EL1 and EL2. Edges are
drawn according to the routing table−→E (incoming neighbors) and
←−E (outgoing neigh-
bors). For a selected generating source gi,←−E (gj) = 1 (shown as an edge (gj, gi)) indicates
that gj is a valid replacement of gi w.r.t. EL1, and−→E (gj) = 1 (shown as an edge (gi, gj))
indicates that gj is a valid replacement of gi w.r.t. EL2. The generating source that are
selected in G1, and G2 are shown with circular, and square boundaries respectively. For
a generating point g with Gon(g) = 0 the number on the left indicates its cost, whereas
the number on the right indicates the cost of the least cost path originating from g (cm).
Whereas for a generating source with Gon(g) = 1, its cost is shown on its right. The
generating source selected in current round is shown in black (with the corresponding
cm shown in gray circle). After the third round the final (optimal) selection of the gen-
erating sources is also shown. The cost of the optimal solution is 12. In all the rounds
the shortest path is zero hops. In first round round1, generating point (g3, 0) is selected,
as for it both G1(g3, 0) = 1 and G2(g3, 0) = 1 and the cost of the path from (g3, 0) to
itself is the least among all the paths between any generating point (gk, tl) such that
G1(gk, tl) = 1 to any generating point (gb, ty) G2(gb, ty) = 1. Similarly, in rounds round2
and round3 the generating sources (g6, 0) and (g4, 0) are selected. Also note that all of
Chapter 5. Distributed Power Procurement for Data Centers 119
the selected generating points requires the same transmission line but are not in conflict
with each other as they will be coupled in different time slots due to the generation delays
associated with them.
Theorem 36 The Algorithm T-RCPP(EG) provides an optimal solution to the T-Restricted
Constrained Power Procurement (T-RCPP) problem.
Proof: The correctness of algorithms follows from Lemma 35 and correctness of the
weighted matroid intersection algorithm by Edmonds [178].
Lemma 37 For Algorithm T-RCPP, the number of message exchanged is M = O(T ` n2)
and the local time complexity is Tlocal = O(T ` n2).
Proof: The complexity of Algorithm T-RCPP follows directly from the complexity
of Algorithm 1-RCPP. The only difference is the increased number of generating points
(each generating point is copied in the worst case for each time slot), which precisely is
T times more than the actual set of generating points. Therefore message and local time
complexity is multiplied by a factor T as compared to Algorithm 1-RCPP.
5.6 Constrained Power Procurement Problem
In this section we show that the CPP problem is NP-hard by a reduction from the opti-
mization version of the independent set (IS) problem, i.e., the maximum independent set.
Before proceeding to the proof of NP-hardness, we give the conflict graph representation
of the transmission line constraints.
Definition 38 (Conflict Graph) A conflict graph Gc(V c, Ec) has vertex set defined as
V c=g1, · · · , gn, and edge set defined as Ec=(gi, gj)|Wi +Wj > Lk where gi, gj ∈ xk.
Theorem 39 The Constrained Power Procurement problem is not only NP-hard, but
also NP-hard to approximate within a ratio of n1−ε for any constant ε > 0.
Chapter 5. Distributed Power Procurement for Data Centers 120
Figure 5.8: A sample execution of Algorithm T-RCPP for the system shown in Example5.1.1.
Chapter 5. Distributed Power Procurement for Data Centers 121
Proof: We first define an instance of maximum independent set problem. An
instance IS for the maximum independent set problem is defined by a graph H(V ′, E ′),
with vertex set V ′ and edge set E ′. The objective is to find a maximum cardinality subset
of V ′ such that vertices in this subset are not directly connected to each other. We define
the corresponding instance of the CPP problem as follows:
• G=V ′, and gi=v′i ∀i, i.e., each vertex vi ∈ V ′ corresponds to a generating source gi.
• The policy constraint set is: PC(G) = s1, · · · , s|V ′|, where s1 = g1, · · · , s|V ′| =
g|V ′|.
• The conflict graph Gc(V c, Ec), capturing the transmission line constraints is defined
as: V c=V ′, and Ec = (gi, gj)|(v′i, v′j) ∈ E ′. Thus, any two power sources (gi, gj)
cause overload for the competing transmission line if (v′i, v′j) is an edge in E ′.
• The planning horizon is defined to be T − 1 = 0 or T = 1.
• For the delay constraints:
– d(gi) = 0 ∀i.
• c(gi, tj) = 1 ∀i, j.
Due to the one-to-one correspondence of graphs H ′(V ′, E ′) and Gc(V c, Ec), a max-
imum independent set in H ′(V ′, E ′) is the same as a maximum independent set in
Gc(V c, Ec). Formally, an independent set in Gc(V c, Ec) corresponds to the generating
sources that can simultaneously be selected for power procurement since they do not re-
sult in transmission line overload. Furthermore, as each vertex in H ′(V ′, E ′) corresponds
to a facility area in CPP, therefore finding the maximum cardinality of independent set
is equivalent to maximizing the number of facility areas being served, and cardinality of
the the maximum independent set in Gc(V c, Ec) is equal to∑
si∈PC(G)
f(si). Finding the
Chapter 5. Distributed Power Procurement for Data Centers 122
maximum independent set is not only an NP-hard problem but also NP-hard to approx-
imate within a ratio of n1−ε for any constant ε > 0 [181]. Therefore the CPP problem is
NP-hard to approximate within a ratio of n1−ε for any constant ε > 0.
5.7 Performance Evaluation
5.7.1 An overview of relevant solution techniques
In this section we highlight the specific properties that make the proposed solution tech-
nique a preferred choice. There are many ways to interpret the preference of solutions,
for instance the notion of feasibility, optimality, time complexity, and restriction on the
objective function if any. We have discussed these important characteristics in Section
5.1 which implies that choosing either Linear Program, or Integer Linear Program for
solving the RCPP problem, proven in this chapter to be polynomial time solvable, re-
sults in undesirable solution quality in terms of one or more parameters compared to the
presented matroid theory based solution. We therefore propose to compare our solution
technique to a greedy solution that have desirable solution parameters similar to the
proposed matroid theory based solution in the subsequent section.
However, we point out that the other techniques do have merits in classical settings
but they do show weakness when we have to expand the classical problem to include the
emerging constraints.
5.7.2 Experimental Setup
To evaluate the performance of the propose scheme working for practical power systems,
we performed a simulation study using IEEE 300 bus test system [182] to capture the
transmission line constraints. IEEE 300 bus test system was developed by IEEE Test
Systems Task Force in 1993, so it does not incorporate the smart grid scenario with a large
number of distributed generating sources. To incorporate the realistic setting with smart
Chapter 5. Distributed Power Procurement for Data Centers 123
grid including a large number of generating points, we extended the number of generating
points in this system which is a known practice (see for example [183]). Specifically, we
divided the given network into 30 distributed data center facilities. Note, different load
demands, operation of circuit breakers, manual operation of the isolators, maintenance
and failure patterns of different devices (transformers, generators, transmission lines,
buses, relays etc), availability of renewable sources (like power production from wind
turbines, solar cells, etc) give rise to a large number of possible realizations of this system.
For each experiment we consider a realization of this system selected from all possible
realizations in an unbiased way. Each point in the graphs presented in this section is an
average taken over 100 experiments.
To supply a facility, at least one generating source must be able to supply its load
within the planning horizon, and the generating point should also be allocated transmis-
sion network with sufficient capacity to be able to supply the power to the facility. All
the generating sources that share the same bottleneck link over the power transmission
network have conflict with each other, this defines Transmission Line Constraints. The
planning horizon is set to one time unit, and generation delays to be zero, so Delay
Constraints imposes the schedule duration of one time unit (i.e., only one allocation at
time 0). Regarding Policy Constraints, we say a facility has been served if at least one
generating source is able to supply its load before the elapse of the scheduling horizon,
and the generating point is also allocated a transmission line be able to supply the power
to the area.
The objective is to serve maximum number of facilities such that total procure-
ment cost is least, without violating the the Transmission Line and Delay Constraints.
Throughout the rest of this section we shall use the word cost to specify cost of procure-
ment.
Chapter 5. Distributed Power Procurement for Data Centers 124
5.7.3 Percentage procurement and Cost
In this set of experiments we use average percentage procurement, and average normalized
procurement cost as performance metrics.
For a given realization, the percentage procurement (α) is defined to be the percentage
of the facilities served without violating the constraints. For a given power network, the
procurement cost per facility (β) is defined to be the sum of the procurement costs of all
the selected generating sources divided by the number of facilities served.
For a given power network, the transmission network density (TND) is defined to
be the total number of transmission lines in the power network divided by the total
number of the facilities. Note that in a realization, there might be more transmission
lines available to serve one facility than another facility. So, TND shows on average, how
many transmission lines connect a facility to the generating sources. For a given power
network, the average procurement choices per facility (PCF ) is defined to be the total
number of generating sources divided by the total number of facilities.
The first set of experiments shows α, and β as a function of TND. We consider three
different settings. Figures 5.9, 5.10, and 5.11 show α versus TND using the Algorithm T-
RCPP and the greedy heuristic for different values of PCF . Figures 5.12, 5.13, and 5.14
show corresponding β for the results in the Figures 5.9, 5.10, and 5.11 respectively. The
results show that with a higher PCF , Algorithm T-RCPP results in higher α, whereas
the greedy heuristic does not show improvement even when PCF is increased. The
results also point out the large gap between optimal solution (using Algorithm T-RCPP)
and the greedy solution in terms of both α and β. An interesting point to note is that
when PCF is less than a certain threshold even the Algorithm T-RCPP, i.e., the optimal
solution, cannot provide 100% procurement. This follows from the fact that lower PCF
results in more Transmission Line Conflicts. Note that where Algorithm T-RCPP can
Chapter 5. Distributed Power Procurement for Data Centers 125
Figure 5.9: Average Percentage procurement α for random power networks as a functionof Transmission Network Density TND for PCF = 2.
Figure 5.10: Average Percentage procurement α for random power networks as a functionof Transmission Network Density TND for PCF = 4.
Chapter 5. Distributed Power Procurement for Data Centers 126
Figure 5.11: Average Percentage procurement α for random power networks as a functionof Transmission Network Density TND for PCF = 8.
provide up to 100% service, the greedy heuristic cannot provide more than 60% service.
It is important to note that the proposed Algorithm always guarantees the lower cost
solution as compared to the greedy solution. Also the difference between the proposed
solution and the greedy solution in terms of β is more significant when the conflicts on
the usage of transmission lines are more due to lesser TND. Moreover, the percentage
procurement α for both the proposed solution and the greedy solution increases as the
number of available transmission lines per area increases. One of the primary reasons
for this is the fact that the fewer transmission lines available to any service area dictates
more and more conflicts over the usage of transmission lines resulting in fewer areas to
be served even optimally. These result also highlights the significance of the proposed
solution for the future power networks where the density of generation sources is visioned
to be increased due to increased penetration of distributed generation.
The second set of experiments shows the relationship of the α, and β versus average
Chapter 5. Distributed Power Procurement for Data Centers 127
Figure 5.12: Average Procurement Cost per Facility β for random power networks as afunction of Transmission Network Density TND for PCF = 2.
Figure 5.13: Average Procurement Cost per Facility β for random power networks as afunction of Transmission Network Density TND for PCF = 4.
Chapter 5. Distributed Power Procurement for Data Centers 128
Figure 5.14: Average Procurement Cost per Facility β for random power networks as afunction of Transmission Network Density TND for PCF = 8.
PCF (The average number of sources is gradually increased from 60 to 200). We con-
sider two different settings for TND. Figures 5.15, and 5.16 show the comparison of α
versus the average PCF , for power networks with TND equals to 30 and 45 respectively.
Figures 5.17, and 5.18 show β corresponding to the results in the Figures 5.9, and 5.10
respectively. The results show that Algorithm T-RCPP is able to provide 100% procure-
ment even with small number network density, which highlights the importance of the
proposed solution to achieve the best possible procurement in practical settings where
the resources are usually scarce. Furthermore, having more resources (TND, and PCF )
helps in reducing the overall procurement cost. The results also point out the large gap
between using an optimal solution (using Algorithm T-RCPP) and the greedy heuristic
in terms of both α and β.
Note that the procurement cost for both the proposed solution and the greedy solu-
Chapter 5. Distributed Power Procurement for Data Centers 129
Figure 5.15: Average Percentage procurement α for random power networks as a functionof Procurement Choices per Facility PCF for TND = 30.
Figure 5.16: Average Percentage procurement α for random power networks as a functionof Procurement Choices per Facility PCF for TND = 45.
Chapter 5. Distributed Power Procurement for Data Centers 130
Figure 5.17: Average Procurement Cost per Facility β for random power networks as afunction of Procurement Choices per Facility PCF for TND = 30.
Figure 5.18: Average Procurement Cost per Facility β for random power networks as afunction of Procurement Choices per Facility PCF for TND = 45.
Chapter 5. Distributed Power Procurement for Data Centers 131
tion decreases with the increase in the PCF due to the availability of better and less
costly choices for power up each facility. Also note that although the greedy solution
attempts to pick the least costly solution for each facility locally but it still cannot
compete with the globally optimal matroid based solution. The reason behind this is
since choosing least cost generating source for each facility might result in conflicts over
the usage of transmission line and ultimately lead to poor percentage procurement and
more expensive alternates to avoid transmission line overloads than the optimal solu-
tion can guarantee. This huge difference between the proposed solution and the greedy
solution in terms of both cost and percentage procurement reinforce the significance of
the proposed matroid based solution. Note that due to mismatch in the growth of the
power network and distributed generation sources [136], the future of power procurement
might face more transmission line bottlenecks. Under these developing constraints the
proposed Algorithms shall not only help in increasing percentage procurement but also
reducing the effective system cost by finding lesser cost alternatives while still meeting
the policy-driven constraints.
5.8 Summary
In this chapter, we study how green initiatives and cyber-physical evolution of power
systems can affect the complexity of optimizing power procurement for distributed data
center facilities giving rise to opportunities to consider new formulations. In particular,
we present a general framework for modeling the constrained power procurement (CPP)
problem, where the distributed data center facilities exhibits policy constraints, power
grid exhibits transmission line constraints, and delay constraints for a given planning
horizon to meet the forecasted demand of distributed facilities. We have presented poly-
nomial time distributed solutions to the 1-RCPP and T-RCPP problems and shown the
NP-hardness of the CPP problem.
Chapter 6
Concluding Remarks and Path
Forward
Data centers represent a backbone for big data processing, yet they have the potential to
become processing bottle-necks and energy black-holes. Unless data centers are carefully
designed to be an efficient enterprise, we risk reversing the motivation of their inception.
Therefore, it is imperative that the big data paradigms holistically incorporate challenges
of greener optimization. A vital step towards this goal is to rethink the foundations of
big data optimization focusing on the existing infrastructure weaknesses that snag the
full comprehension of the greener big data potentials. In Chapter 3 we have identified
the planes crucial for making big data enterprise greener. Moreover, we have presented
an extensive survey of the on going efforts in each of the planes and highlighted the
promising challenges associated with each plane. Chapter 4 described a scheme to opti-
mize big data processing in favor of the greener server, virtualization, interconnect, and
framework planes. Additionally, Chapter 5 presented a solution for optimizing the power
procurement plane.
We argue that a true green big data ecosystem requires collaboration not only from
the core of big data enterprise—data centers—but also from other parts of the ecosystem
132
Chapter 6. Concluding Remarks and Path Forward 133
including, but not limited to, end users, standard development bodies, regulatory bodies,
equipment manufacturers and suppliers, network operators, and content providers. Ac-
cordingly, we present some open problems and challenges in this chapter, and conclude
by proposing a novel, futuristic, green orchestrator focusing on a comprehensive view of
the ecosystem.
6.1 Open Issues and Future Work
Traffic profiling, measurement, and analysis is a fundamental step toward realizing an
energy-efficient network infrastructure. Since the traffic profile is strongly coupled to the
types of applications and services, application-specific traffic-forecasting based decisions
can provide better and more efficient energy management.
Furthermore, the interconnect plane associated with cloud-based data centers can
neither be strictly restricted to conventional tree topology nor confined to strictly high
bisection topologies like VL2, Fat-Tree, and Jellyfish [60,83,184]. Rather, an interconnect
plane of a cloud-based service can be a combination of many different topologies at
distant sites perhaps connected via internet. Therefore, restructuring existing solutions
and techniques to suit the dynamics of more general interconnects associated with cloud
paradigms is crucial for greener services.
Dynamics of data center networks are shifting towards adapting energy-efficient inter-
connects and protocols like IEEE 802.3az [185], but in certain scenarios this can degrade
the quality of the application. A key challenge in this regard is to develop solutions that
can achieve a good trade-off between energy efficiency and performance metrics.
Furthermore, the impact of energy-efficient server plane on Quality-of-Experience
(QoE), and Service-Level-Agreements (SLAs) is also constitutional, and can help bridge
the gap between theory and business practices.
It is necessary to investigate application specific designs that can improve IT equip-
Chapter 6. Concluding Remarks and Path Forward 134
ment utilization and efficiency in a transparent and seamless fashion. Optimization
solutions that exploit the correlation among large data sets can help to reduce the use of
IT resources required by the frameworks. Furthermore, there are several ways in which
software-defined networking can help in bringing about the necessary changes to make
big data processing energy efficient.
There is also a need to address tool-specific energy inefficiency for big data processing.
A plausible direction is the development of a DPM controller specific to a tool e.g., DPM
for Hadoop.
Revolutionary models advocating nano data centers are getting noticed. Nano data
centers are peer-to-peer distributed data centers that make use of ISP-controlled home
gateways as both computing and storage resources. Nano data centers have been shown
to be an energy-efficient alternate to conventional data centers. Exploring the interplay
of the frameworks with nano data centers is very beneficial.
The business models for many of today’s multi-tenant cloud data centers are based
on selling computing resources to remote clients, e.g., Amazon EC2. Data centers in
such scenarios do not have advanced statistics about the jobs they run. Hence there is
little to no control over effectively scheduling jobs, or deploying DPM-based techniques
in order to improve power efficiency. Furthermore, DVFS-based solutions to improve
power efficiency are not very viable in today’s multi-tenant cloud-based data centers
since these might violate the SLAs where a tenant is paying for a computing resource
with specific operating frequency. Therefore, there is a need to study the impact of power
management solutions on SLAs to appropriately update agreements while still managing
power efficiently.
Moreover, there is a need to empirically study the effectiveness of various energy
efficiency metrics to better understand the energy usage dynamics of multi-tenant en-
terprises. The business models of the big data enterprises need to monitor and report
appropriate energy efficiency metrics thus encouraging the responsible use of resources
Chapter 6. Concluding Remarks and Path Forward 135
and services [6].
A major challenge for integrating renewables into the energy portfolios of data centers
is the unpredictability of weather-dependent renewable sources (e.g., solar energy, wind
energy). Specifically, solar power deployment is still far from reality due to the fact that a
large number of solar panels is required to produce a tiny fraction of the energy required
by a data center. For example, according to Emerson Network Power it takes seven acres
of solar panels to generate 1 MW [186].
It is also of fundamental importance to empirically study the impact of integrating
greener energy on QoS and SLAs, and develop appropriate schemes.
With the increasing popularity of cloud-related services it is vital to explore the appli-
cability of existing techniques for energy profile management to multi-tenant data centers.
Moreover, there is a need to devise new demand response strategies for multi-tenant data
centers incorporating looser controls over customer-driven workloads. On the other end,
there is a need to explore mechanisms for utility-based incentives offered to multi-tenant
cloud data centers. Similarly, the realization of revolutionary nano data centers calls for
special business paradigms and stimulus-based rewards for all stakeholders [187].
Furthermore, there is a need to include more comprehensive system constraints in
the presence of data uncertainty, reliability, and security issues. Studying the impact of
storage systems and renewable power generators while incorporating a detailed analysis
of the specific parameters associated with the nature of the generating source is the
other thrust. Moreover, there is a need to develop algorithms under real-time and online
scheduling environment while taking into account effects of the communication system
performances as well as swiftly changing market prices.
Chapter 6. Concluding Remarks and Path Forward 136
6.2 Green Orchestrator
The significance of efforts in the individual planes of a big data enterprise is indisputably
valuable, however we assert that an orchestrator that takes on a cross-plane approach
for energy efficiency in a holistic fashion is very important. Among others, one of the
needs for such a cross-plane orchestrator is to make sure that the solution for one plane
is complemented by solutions for other planes towards the common goal of establishing a
greener system. For instance, a solution technique that optimizes the virtualization plane
with respect to energy efficiency might not be well supported by the insufficient capacity
of the interconnect plane. More specifically, a scheme that relies on VM migrations for
optimizing energy efficiency needs more bandwidth and might stress the interconnect
plane, which is already under pressure. Similarly, APIs developed to facilitate energy
efficiency but without fully respecting the underlying hardware or system can result in
redundant resource utilization. Likewise, generation-mix for the energy portfolio of a
batched-job can be much more flexible in incorporating greener and renewable sources
by mitigating the uncertainties associated with renewables, e.g., by scheduling tasks in
the batch, whereas such a flexibility might not be viable for a real-time workload.
We recognize that system requirements are highly specific to workloads and appli-
cations, and therefore urge that developing a cross-plane integrated solution for such a
huge system should be done on a per-job basis and in accordance with the type of frame-
work. In this section we propose a novel per-job cross-plane orchestrator that can be
implemented to improve energy efficiency. The proposed cross-plane Green Orchestrator
consists of the following components.
• Green Lessor: This component is responsible for establishing a per-job SLA based
on a collaborative lease among the stakeholders. The stakeholders include data
center, tenant, and energy provider (both utility and local generation). The lease
takes into account available power, pricing models, power consumption statistics
Chapter 6. Concluding Remarks and Path Forward 137
Figure 6.1: Green Orchestrator for big data processing.
Chapter 6. Concluding Remarks and Path Forward 138
of jobs, network state, and server state while incorporating the predictability of
the available local energy harvest. Green lessor takes input from network state
predictor, server state predictor, and post-execution analyzer.
• Pre-Execution Analyzer: This component performs pre-execution analysis of
a job. Based on the job statistics, and history logs from post-execution analyzer,
pre-execution analyzer identifies the power statistics and requirements while incor-
porating the lease SLAs.
• Network State Predictor: This component predicts the network state. Specif-
ically network state predictor takes into account inbound and outbound queues
at the switches and routers, available bandwidth, and throughput. Network state
predictor interacts with network dynamic power manager to keep the predictions
updated.
• Server State Predictor: This component predicts the state of the server and
compute nodes in terms of CPU and memory resources available for a job. Server
state predictor interacts with server dynamic power manager to keep the predictions
updated.
• Network Traffic Optimizer: This component focuses on optimizing the network
by performing traffic engineering. Specifically, it eliminates the redundant traffic,
and reshapes the flows. Network traffic optimizer interacts with network dynamic
power manager, and server dynamic power manager to optimize the traffic.
This component is very useful in distributed big data applications where inter-
process communication is not scheduled for optimal energy consumption. For ex-
ample within a Hadoop job, the shuffle phase does not take into account energy
conservation while initiating the communication between map and reduce processes.
The shuffle process can be made greener if amortized cost of energy is taken into
account while scheduling inter-process communication. Specifically, the flows in
Chapter 6. Concluding Remarks and Path Forward 139
the shuffle phase can be scheduled collaboratively with map and reduce processes
to minimize the number of less-utilized or idle servers. Then DPM techniques can
be used to put a set of servers into sleep or low energy mode.
• VMizer: This component is in charge of VM placement, provisioning and schedul-
ing. VMizer works in coordination with network state predictor and server state
predictor. The overall objective of this component is the intelligent placement of
VMs while amalgamating them into a subset of cluster. This improves system uti-
lization by allowing a greater number of VMs per node (scaled according to the
CPU and storage capacity) and putting the rest of the nodes in sleep mode. For
example concentrating mappers and reducers in Hadoop, and nuts and bolts in
Storm such that each active node in the cluster is fully utilized, rather than a large
number of partially utilized yet active nodes. This not only reduces the number of
active nodes in the cluster, but also concentrates the amount of traffic to a limited
portion of the cluster giving an opportunity to put unused network devices into a
sleep mode.
• Pizer: This component is in charge of process placement, provisioning, and schedul-
ing. Pizer facilitates for intelligent placement of processes (e.g., mappers and re-
ducers in Hadoop, nuts and bolts in Storm) to a subset of the cluster with the aim
of improving resource utilization. For example in Hadoop, pizer places map and re-
duce processes that need to communicate with each other in close proximity. Pizer
places the process on VMs while respecting locality and bandwidth limitations to
improve turnaround time.
• Server-Dynamic Power Manager: Server-DPM performs server level dynamic
power management and works in coordination with server state predictor.
• Network-Dynamic Power Manager: Network-DPM performs network level dy-
namic power management and works in coordination with network state predictor.
Chapter 6. Concluding Remarks and Path Forward 140
• Post-Execution Analyzer: This component analyzes a job’s energy profile upon
completion.
The synergy of different components of the proposed Green Orchestrator is shown in
Figure 6.1. Several solutions discussed in Chapter 3 can be used to implement some of
the proposed components. However, some components need more groundbreaking work,
specifically green lessor, network traffic optimizer, and pizer.
We believe that the discussion with regard to different components of the proposed
orchestrator can help to fill in the gaps in efforts addressing appropriate traffic engineer-
ing, business models, energy portfolio management, and process-level energy envisioning.
Furthermore, this work expects to see development efforts to realize and demonstrate the
suitability and practicality of proposed orchestrator for resuscitating the green vision of
the big data ecosystem.
Appendices
141
Appendix A
Energy Savvy Planes for Big Data
Enterprise: Literature Survey and
Discussions
This appendix surveys several key approaches to transform 5Ps of big data enterprise,
identified in Chapter 3, into greener planes.
A.1 Server Plane
This section presents some of the key approaches to make server plane energy savvy.
The notion of energy proportionality, i.e., energy consumption proportional to the
workload [127], is at the heart of achieving energy efficiency for underutilized servers.
Dynamic Power Management (DPM) has been advocated to be very effective for
directly addressing power inefficiency caused by underutilized servers. DPM adjusts the
power consumption of a device by taking into account the workload prediction [188–190].
One of the famous techniques for DPM is Dynamic Voltage and Frequency Scaling
(DVFS) [191–195]. The basic idea behind DVFS is to save energy using low supply
voltage, and low frequency (clock) when the work load is not computationally intensive,
142
Appendix A. 5Ps: Literature Survey and Discussions 143
or there is not a need to operate a processor at maximum performance.
Server consolidation is another approach for DPM of underutilized servers. Server
consolidation-based techniques achieve energy efficiency by consolidating jobs to a limited
number of highly utilized servers, and then transitioning rest of the servers to low power
or off states [196–200].
With the expanding big data enterprise, the heterogeneity of computing and storage
equipment is increasing and would likely impact power management solutions. Specifi-
cally, energy usage characteristics are highly dependent on the underlying platform, and
any existing DPM solution that ignores the heterogeneity is not going to live up to the
solution standard. Therefore a practically viable yet optimally energy-efficient solution is
required to respect heterogeneity. Some scheduling techniques exploit the heterogeneity
of IT resources in data center environments by maximizing the use of energy-efficient
components, and minimizing the use of energy-inefficient components by means of work-
load scheduling [201,202].
Job scheduling is another approach to achieving higher server utilization by better
workload management with a focus on energy conservation [203–205].
Another approach for turning the server plane into a greener plane is to explore the
effects of hardware characteristics on energy efficiency. In this regard, an intriguing
study has been conducted in [206], where authors have compared energy characteristics
of several small clusters of embedded processors, mobile processors, desktop processors,
and server processors for running big data applications. The interesting finding is that a
data center cluster made up of high-end mobile processors and NAND-flash-based solid-
state-drives is the most energy efficient system for data intensive applications.
A.2 Virtualization Plane
Virtualization is a primary approach adapted for improving resource utilization at servers.
Appendix A. 5Ps: Literature Survey and Discussions 144
Despite its apparent usability for energy efficiency and broad penetration in data
centers, the actual deployment of virtualized servers is not a very prevalent practice;
for example a 2010 survey reported that only 30% of production servers actually deploy
virtualization techniques [207].
Server virtualization under dynamic workloads results in a huge increase in network
traffic. This increase is a serious hurdle in the wide deployment of virtualization en-
vironments due to the limited available bandwidth for east-west traffic (traffic between
different servers within a data center). One way to tackle this issue is new data cen-
ter architectures that provide a higher rate of information exchange for east-west traf-
fic [60, 83, 184, 208, 209]. Another direction is to perform live migration of VMs between
physical servers which might be running different hypervisors. Some tools like [210] allow
offline migration of the VMs between different hypervisors.
A.3 Interconnect Plane
Network links in data centers are not properly utilized. According to a study, network
links operate at 5%-25% of their capacity on average [30], and a significant number of
network links remain idle for 70% of the time [211], thus leaving substantial room for
improving energy efficiency. With the increase in network-intensive distributed workloads
for big data processing, the need to rethink energy efficiency for data center networks
has never been more prevalent.
In the context of energy efficiency, the conventional three-tier tree topology is known
to be energy inefficient partly due to use of enterprise level power-hungry equipment [212].
For large data centers the switches at the top layer in a three-tier conventional topology
are reported to consume huge amounts of energy [83]. The energy inefficiency of the con-
ventional data center topologies has led to the proposal of new energy-efficient topologies
like VL2 [60], Fat-Tree [83], and Jellyfish [184] that use low-cost energy-efficient commod-
Appendix A. 5Ps: Literature Survey and Discussions 145
ity network equipment. The basic idea in these proposals is to use densely connected
interconnect with a combination of commodity of-the-shelf switches, flat addressing, and
efficient load balancing. In particular, Fat-tree reported an energy efficiency improvement
of 56.6% as compared to conventional data center networks [83].
Based on Fat-Tree architecture the authors in [51] proposed a network-wide power
management scheme (DPM for network components). The core idea is to consolidate
traffic flows on a subset of links and switches. As a consequence of this consolidation, the
unused network components can be switched off resulting in significant power savings.
Another aspect of bringing energy efficiency to data center networks is the provision
of load-proportional energy consumption in links and switches. In such a network de-
vice the energy consumption depends on the amount of data flowing through it [26–30].
For example, authors in [26, 28, 29] have used adaptive link rates combined with load-
proportional energy-consuming network devices to reduce power consumption. One more
approach is to combine both traffic consolidation and adaptive link rates for switches to
improve power saving [213]. An extension of DFVS with the focus on scaling the voltage
and frequency of the switch core and interconnect links has also been shown to address
the energy inefficiency caused by low utilization of network links [214].
Another venue to address the energy challenges of the interconnect plane is to test
the use of unorthodox and novel technologies to replace conventional wired interconnect
fabric and routing operations. For example it has been shown that the use of wireless
networks can result in improving energy efficiency in data centers, but this idea is still
in the early stages of development [215–217].
Furthermore, software-defined networking (SDN) based traffic engineering with re-
spect to virtulization techniques achieves a better network resource utilization by dynamic
resource allocation [218]. SDN, specifically OpenFlow, has also been used to demonstrate
integrated adaption of computing and network resources with the aim of minimizing the
carbon emissions of a data center [219].
Appendix A. 5Ps: Literature Survey and Discussions 146
A.4 Frameworks Plane
The main driving force for the adoption of cloud data centers is rooted in improved
application performance and higher operational efficiency for big data frameworks and
services. The core of major big data frameworks is to divide huge data sets into smaller
subsets, process them at separate nodes, communicate the processed information over
the network connecting different nodes (also referred to as the communication phase),
and then combine the results in a time-efficient manner.
Energy efficiency of big data frameworks can be improved by eliminating the idle time
of IT equipment. An efficient way to reduce idle time is by shortening the communication
phase of distributed applications since a compute node keeps on consuming energy while
waiting to receive data spread across several nodes. In fact, it has been shown that bring-
ing down job completion times improves energy efficiency. It is interesting to note that
for some common big data frameworks about 50% of the job completion can be consumed
in the communication phase (the phase in which data is transferred across servers over
the data center network) [57]. The communication phase can be optimized, hence de-
creasing job completion time, by ameliorating network effects and bringing disruptive (in
terms of communication intensity) distributed workloads by managing flows [57, 68, 69],
and by using demand-driven path alterations [70–72]. Task, resource, and computation
scheduling are some of other ways to bring down job completion time [65–67].
Hadoop’s energy efficiency has been a subject of quite interest as it is the most widely
used framework for big data analytics. An interesting effort to make Hadoop greener is
the integration of solar energy in to a Hadoop cluster. This integration is achieved
through intelligent job scheduling by taking into account the solar energy forecast [220].
Some other approaches to optimize Hadoop include high-level shifting of VMs, mitiga-
tion of congestion incidents using combiners, and low-level continuous runtime network
optimization in coordination with data-intensive application orchestrators [42, 57, 221].
Similarly, information theoretic similarities in chunks of files spread across different nodes
Appendix A. 5Ps: Literature Survey and Discussions 147
can be used to reduce the amount of energy used by reducing network traffic for Hadoop.
A.5 Power Procurement Plane
This section surveys the efforts for optimizing the power procurement plane including
power sourcing, appropriate business models, and greener portfolio management. It has
been pointed out that the popularity of multi-tenant data centers is going to be an im-
portant contributing factor for industry-wide inefficient usage of energy. Lack of proper
pricing models, and conflicting priorities are two major reasons for the inefficient use
of energy. More specifically, the priorities at the top of the list for multi-tenant data
center providers generally include high levels of availability, reliability, privacy, security,
and affordability [6]. The inherent conflict of priorities with energy efficiency goals is
obvious, and partly explains the paucity of energy initiatives in current business models.
Furthermore, customers are not monetarily motivated in energy-efficient infrastructures
and frameworks due to lack of appropriate lease agreements that charge customers in pro-
portion to their energy usage. To address this concern there has been a growing interest
by data center service providers for an environmental chargeback model that links the
tenants to indirect costs associated with energy usage [222]. Practical implementations
of such chargeback models are being adopted by Microsoft data centers where customers
are charged for power usage [223].
The reforms in chargeback models are becoming more complex with the rise of cloud
infrastructure, where the main theme is to abstract services from IT infrastructure.
Adding to the complexity is the virtualiztion of shared resources that can be acquired
and released on demand. Such chargeback models should incorporate flexible pricing
for virtualized environments. In this context Cloud Cruiser is an initiative to integrate
chargeback models with cloud infrastructure and decision analytics, e.g., Cloud Cruiser
for Amazon Web Services (AWS) [224]. There is a need to extend such efforts to inte-
Appendix A. 5Ps: Literature Survey and Discussions 148
gerate energy prospects, in particular to develop special per-kWh chargeback models that
suit the needs of multi-tenant shared enterprises [225]. A possible solution can be the
inclusion of a “post-paid” component to the chargeback models reflecting energy-related
costs for IT resources at the actual time of use.
Another strategy that can help businesses to gear up for energy-aware practices is
measuring and reporting energy consumption metrics. In this context, the Green Grid
Association (TGG) has proposed measuring a data center’s energy efficiency by using
Power Usage Effectiveness (PUE) [226]. PUE is defined as the total energy used by a
facility divided by the energy used by its IT equipment. However it has been noted that
PUE often does not take into account all components of the infrastructure, but rather
focuses on one particular segment of a data center, and therefore does not adequately
reflect its efficiency [227]. In this vein, there have been efforts to redefine the metrics
that can capture the energy inefficiency in a better and more concrete fashion such as
Power to Performance Effectiveness (PPE), Data Center Productivity (DCeP), Data
center Compute Efficiency (DCcE), and Data center infrastructure efficiency (DCiE) to
name a few [6,227].
Under the coincident peak pricing (CPP) programs, the customer’s electricity bills are
not just scaled according to the usage (amount of electricity consumed) but also reflect
the peak demand and coincident peak demand [142]. Peak demand charge is based on
the hour of the month where consumer’s electricity consumption was the highest. It
has been reported that peak demand charge is the major component of the electricity
bills for all six Google data centers in U.S. [141]. Similarly, coincident peak demand
charge is based on the hour of the month when the utility received the highest total
demand from all the customers. Peak demand charge is several times higher than usage
charge. Moreover, the coincident peak demand charge is several times higher than peak
demand charge. Consumers are typically warned of the upcoming coincident peak hour
somewhere between 5 minutes to 24 hours ahead of time [142].
Appendix A. 5Ps: Literature Survey and Discussions 149
Demand response is defined as altering the electricity usage pattern to reflect the
change in the price of electricity at a specific time. Demand response plans also include
incentives to encourage lower electricity consumption at times of high demand. Demand
response plans can help in shifting loads from coincident peak hours as well as reducing
the peak load charges.
Due to unpredictability associated with the smooth integration of renewable energy
into the power grid, demand response by consumers is crucial both for suppliers (helping
them to increase the share of renewables) and consumers (smart use of electricity for
lower electricity bills). Since the booming big data enterprise has made data centers one
of the most rapidly expanding electricity consuming sectors in developed countries [228],
so an intelligent demand response from data centers is of greater significance for green
energy portfolios.
In the context of demand response, authors in [142] presented schemes to scale down
peak load as well as electricity expenditures of data centers by a combination of peak
load shifting and the use of local power generation. Similarly, authors in [141] presented
a scheduling scheme for the partial execution of data center workloads for reducing peak
demand charges.
Along with appropriate demand response strategies, greening data centers requires
special incentives offered by power suppliers. Such an incentivized program is offered by
Toronto Hydro-Electric System Ltd. [229], and PowerStream [230]. Under this program,
data center operators receive up to $800 for every peak kilowatt reduction, or $0.10 per
kWh for energy savings by greening their data centers [231].
Integration of renewables is yet another factor that can improve the energy profiles of
data centers. Experimental studies in [134,232,233] have shown that integrating renew-
able energy sources into data center power supplies can help to reduce energy-related costs
while reducing carbon footprints. Some of the real-world examples of wind-powered data
centers include Other World Computing [234], Baryonyx Corp. [235], and Green House
Appendix A. 5Ps: Literature Survey and Discussions 150
Data [236]. Similarly, Microsoft’s Virtual Earth mapping servers in Colorado are housed
in 100% wind-powered Verari container units [237].
Appendix B
Proof: A Coding Based
Optimization for Big Data
Processing
We start by defining an instance of index coding problem [97].
Definition 40 (Index Coding) An instance of index coding problem is defined by a
relay R which contains a set of k packets X = x1, · · · , xk that are to be delivered
to a set of m clients c1, · · · , cm over a broadcast channel CHL. Each client ci has
access to some side information HICi ⊆ X, and requires a set packets WICi ⊆ X from
the relay R. The relay R can transmit packets in X or their combinations (encoding).
The objective is to find a scheme that requires the minimum number of transmissions
from the relay R, and satisfies the requests of all the clients. Let OPT (IC) denote the
optimal solution of index coding problem, and |OPT (IC)| denote the minimum number
of transmissions from the relay.
151
Appendix B. Proof: A Coding Based Optimization for Big Data Processing152
B.1 Proof of Theorem 14
Given an instance of index coding problem as in Definition 40, we define an instance of
the spate coding problem as follows:
• V = AS, node1, S2,
• E = (AS, S), (S, node1),
• X1 = AS,
• X = node1, note that node1 hosts all of c1 · · · , cm
• X2 = S,
• P = x1, x2, · · · , xk,
• Wi = WICi, for i = 1, · · · ,m,
• Hi = HICi, for i = 1, · · · ,m.
Let |P (AS)| denote the number of link-level packet transmissions “made” by the AS
employing spate coding in the optimal scheme S. We show that |OPT (IC)| = |P (AS)|.
We start by analyzing the transmissions in the network N defined by the graph with
the vertex set V and the edge set E. Note that for all the packets transmitted by AS
destined to any of the clients c1, · · · , cm hosted on node1, OPT (S) requires AS to choose
the optimal solution that minimizes the number of packet transmissions on the edge
(AS, S). AS transmits the minimum number of packets by employing spate coding. It is
easier to see that the relay R can use the same scheme as of AS to minimize its number
of transmissions which shall be exactly the same as the packet transmissions by AS on
the edge (AS, S).
Appendix B. Proof: A Coding Based Optimization for Big Data Processing153
B.2 Proof of Theorem 15
The proof follows from Theorem 14, and the existence of on an instance of index coding
problem for each instance of network coding problem [108].
B.3 Proof of Theorem 21
We prove that the Problem ES is NP-hard by reducing index coding problem into it. It
has been already been shown that index coding problem is NP-hard, and NP-hard to
approximate [238].
Given an instance of index coding problem as in Definition 40, we define an instance
of the Problem ES as follows:
We construct a typical data center network N following a tree topology consisting of
two top of the rack switches S1 and S2, and an L2-aggregation switch AS as shown in
Figure B.1. There are a total of T nodes node1, · · · , nodeT in the network N; where T =
k+m. node1, · · · , nodek are connected to S1, and nodek+1, · · · , nodeT are connected to S2.
Each node nodei ∈ node1, · · · , nodek hosts only a sender service instance. Moreover,
the service instance running on nodei ∈ node1, · · · , nodek emits the packet xi. Let the
set U represents the collection of all the packets emitted by the nodes node1, · · · , nodek.
Each node nodej ∈ nodek+1, · · · , nodeT hosts two service instances, a sender service
instance So, and a receiver service instance Sr. A service instance So running on nodej ∈
nodek+1, · · · , nodeT emits a set of packets given by the set HICj. A service instance
Si running on nodej ∈ nodek+1, · · · , nodeT is responsible for processing the set of
packets uj ∈ U s.t uk+j := WICj, where j = 1, · · · ,m. As shown in Figure B.1 any
node nodei ∈ node1, · · · , nodek can communicate with any other node in nodej ∈
nodek+1, · · · , nodeT via AS. Node connected to the same top of the rack switch can
communicate to each other without going through AS. For all the packets passing via AS
it can either transmit their combinations (encoding) or can send the uncoded (routing).
Appendix B. Proof: A Coding Based Optimization for Big Data Processing154
Figure B.1: An instance of the Problem ES.
Let |P (AS)| denote the number of link-level packet transmissions “made” by the AS,
as shown in Figure B.1, in the optimal scheme S. We show that |OPT (IC)| = |P (AS)|.
We start by analyzing the link-level transmissions in the network N. First note that
all the nodes from the set node1, · · · , nodek do not receive any packet during the
communication phase of distributed data center application as these do not host any
service instance that intends to receive a packet. Furthermore, all the packets in the set
U have to pass through the AS; as for each of the packet in U there is a corresponding
receiver service instance hosted on one of the nodes nodek+1, · · · , nodeT. Second note
that for all the packets going through AS destined to a service instance running on any
of the nodes nodek+1, · · · , nodeT, OPT (S) requires AS to choose the optimal scheme
that minimizes the number of link-level packet transmissions on the link CHL. AS can
either encode or route the packets as a part of its optimal scheme by utilizing the side
information (i.e., output of the service instances running on nodes nodek+1, · · · , nodeT.
It is easier to see that the relay R can use the same schemes as of AS to minimize
its number of transmissions which shall be exactly the same as the link-level packet
transmissions by AS on the link CHL in Figure B.1.
Appendix B. Proof: A Coding Based Optimization for Big Data Processing155
B.4 Proof of Theorem 24
Let ζ denote the number of sender service instances in the network (e.g., number of
mappers in Hadoop MapReduce). Without loss of generality we assume that each sender
service instance is required to communicate exactly one packet to its corresponding re-
ceiver service instance (if it needs to transmit more than one packet we replace it with
multiple sender service instances where each needs to communicate just one packet). We
note that d is the maximum number of hops between any sender service instance and its
corresponding receiver service instance. Hence, in a non-coding based solution a sender
service instance would require at most d link-level packets to communicate its packet
to corresponding receiver service instance. As there are ζ sender service instances so a
non-coding based solution would require at most d·ζ link-level packet transmissions. The
coding based solution, on the other hand, requires at least ζ link-level packet transmis-
sions to the coding server, one from each of sender service instances before it can even
proceed with coding. Hence, µ ≥ ζd·ζ = 1
d.
B.5 Proof of Theorem 26
Directly follows from the fact that a coding based solution can not require more trans-
missions than a non-coding based solution. The tightness of the bound follows from the
observation that in all such instances where receiver service instance do not possess any
side information a coding based solution cannot perform better than non-coding based
solution and then µ = 1.
B.6 Proof of Theorem 28
Without loss of generality we ignore all the sender and receiver service instances that
can communicate with each other without passing through the network’s bisection, as
Appendix B. Proof: A Coding Based Optimization for Big Data Processing156
such instances do not contribute to µ(bisection). We further assume, without loss of
generality, that each sender service instance requires to communicate exactly one packet
to its corresponding receiver service instance (if it needs to transmit more than one packet
we replace it with multiple sender service instances where each needs to communicate
just one packet). Next we note a non-coding based solution would result in two link-
level packet transmissions crossing the network’s bisection for each of ζ sender service
instances, one from sender towards the bisection-switch and other from bisection-switch
towards receiver, resulting in a total of 2ζ link-level packet transmissions crossing the
network’s bisection. The coding based solution first requires at least ζ link-level packet
transmissions crossing the network’s bisection to the coding server, one from each of
sender service instances. Hence µ ≥ ζ2ζ
= 12.
Regarding tightness of the bound, µ(bisection) → 12
for the tree topologies where
β sender service instances are located at one side of the bisection-switch (root), and β
receiver service instances are located on the other side of the bisection-switch. Moreover,
each receiver service instance possesses the demands of all the other receiver service in-
stances as its side information. It is easier to see in such a scenario non-coding based
solution requires 2β link-level packet transmissions crossing the network’s bisection. A
coding based solution requires β+1 link-level packet transmissions crossing the network’s
bisection, one from each of β sender service instances to the coding server, and one en-
coded transmission from the coding server to the receiver service instances. The encoded
link-level packet transmission consists of bitwise XOR of all the demands of the receiver
service instance. Hence in such scenario, µ(bisection) = β+12β
, and µ(bisection) → 12
for
β 1.
Appendix B. Proof: A Coding Based Optimization for Big Data Processing157
B.7 Proof of Theorem 29
Theorem 28 proves if the coding is performed at the network bisection then µ(bisection) ≥12. We prove this theorem by presenting an example where µ < 1
2when middlebox for
performing coding is not placed at the network bisection, specifically in the example
presented in Section 4.4.2 µ(bisection) = 14
when the coding is not performed at the
bisection.
Appendix C
Proof: Distributed Power
Procurement for Data Centers
C.1 Proof of Lemma 33
Proof: Consider a set Gf consisting of f generating points, such that Gf ∈ L1∩L2.
We show that Gf does not violate the transmission line constraints, and provides service
to exactly f load areas si,∀i. First, note that since Gf ∈ L2 the f generating points
do not have transmission line conflicts. Second, note that as Gf ∈ L1, the associated
column vectors of these f generating points are linearly independent implying that they
belong to f distinct policy sets. Thus, Gf provides service to at least f different load
areas since each policy set represents the preferences of a distinct load area.
Therefore, a set Gf ∈ L1 ∩L2, such that Gf has maximum cardinality among all sets
belonging to L1∩L2 shall provide service to the maximum possible number of load areas
within constraints. Furthermore, if no other set of maximum cardinality belonging to
L1 ∩ L2, has cost lesser than Gf , then Gf provides an optimal solution to the 1-RCPP
problem. It has been shown that both TLM and SM are matroids, thus a set belonging
to L1∩L2 and having least cost and maximum cardinality among all the sets in L1∩L2 can
158
Appendix C. Proof: Distributed Power Procurement for Data Centers159
be found in polynomial time using the weighted matroid intersection algorithm [171,178].
C.2 Proof of Theorem 34
Proof: The correctness of distributed Algorithm 1-RCPP follows from Lemma 33
and correctness of the weighted matroid intersection algorithm by Edmonds [178]. Specif-
ically, the matroid intersection algorithm of [178] consist of four steps with corresponding
counterparts in Algorithm 1-RCPP.
1. In [178], Edmonds constructs a directed graph where edges represent valid replace-
ments of set of generating points selected in Gon from the set of generating points
not yet selected. This corresponds to Steps 10, 11, 16, 17 in Algorithm 1-RCPP ; in
our algorithm the entries of the routing table−→E and
←−E indicate the valid replace-
ment.
2. Similar to [178], Algorithm 1-RCPP finds vector G1 such that each generating point
in G1 corresponds to the valid addition of generating points to the already selected
generating points (Gon) with respect to L1 of the SM matroid. Similarly we find
vector G2 that corresponds to the generating points that are a valid addition with
respect to L2 of the TLM matroid. Steps 19 and 20 do the same.
3. Analogous to [178], Algorithm 1-RCPP finds a least cost path corresponding to
Phase 2 of our algorithm.
4. Step 32 of Algorithm 1-RCPP computes an updated vector Gon by necessary re-
placements and additions similar to [178].
Therefore the correctness of Algorithm 1-RCPP follows directly from the correctness of
the Edmond’s matroid intersection algorithm.
Bibliography
[1] SINTEF. (22 May 2013) Big data, for better or worse: 90% of world’s data
generated over last two years. ScienceDaily. [Online]. Available: www.sciencedaily.
com/releases/2013/05/130522085217.htm
[2] B. Marr, “Why only one of the 5 Vs of big data really matters,” IBM Big Data
Analytics & Hub, 2015.
[3] The four V’s of big data. IBM Big Data Analytics & Hub. [Online]. Available:
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
[4] G. Cook and J. Van Horn, “How dirty is your data? a look at the energy choices
that power cloud computing,” Greenpeace International, The Netherlands, Tech.
Rep., April 2011.
[5] A. S. Masanet, Eric and J. Koomey, “Characteristics of low-carbon data centers,”
Nature Climate Change, vol. 3, no. 7, pp. 627–630, 2013.
[6] J. Whitney and P. Delforge, “Data center efficiency assessment,” NRDC and An-
thesis, USA, Tech. Rep. IP:14-08-A, August 2014.
[7] J. G. Koomey, “Worldwide electricity used in data centers,” Environmental Re-
search Letters, vol. 3, no. 3, p. 034008, 2008.
160
Bibliography 161
[8] (2014) Industry outlook: Data center energy efficiency. The Data
Center Journal. [Online]. Available: http://www.datacenterjournal.com/it/
industry-outlook-data-center-energy-efficiency/
[9] C. Preimesberger. (2013) Intel’s low-power atom chip making way into data
centers. eWEEK. [Online]. Available: http://www.datacenterjournal.com/it/
industry-outlook-data-center-energy-efficiency/
[10] J. Koomey, “Growth in data center electricity use 2005 to 2010,” Oakland, CA:
Analytical Press, 2011.
[11] J. Glanz. (2012) Power, pollution and the internet. The New York
Times. [Online]. Available: http://www.nytimes.com/2012/09/23/technology/
data-centers-waste-vast-amounts-of-energy-belying-industry-image.html
[12] G. Ghatikar, V. Ganti, N. Matson, and M. Piette, “Demand response opportunities
and enabling technologies for data centers: Findings from field studies,” Lawrence
Berkeley National Laboratory, Tech. Rep. LBNL-5763E, 2014.
[13] R. Brown, E. Masanet, B. Nordman, W. Tschudi, A. Shehabi, J. Stanley,
J. Koomey, D. Sartor, and P. Chan, “Report to congress on server and data center
energy efficiency: Public law 109-431,” Lawrence Berkeley National Laboratory,
Tech. Rep. LBNL-363E, 2008.
[14] G. Sauls. Measurement of data centre power con-
sumption. Falcon Electronics Pty Ltd. [Online]. Avail-
able: https://learningnetwork.cisco.com/servlet/JiveServlet/downloadBody/
3736-102-1-10478/measurement20of20data20centre20power20consumption.pdf
[15] M. Stansberry, “The global data center industry survey,” UptimeInstitute, USA,
Tech. Rep., October 2013.
Bibliography 162
[16] P. Kogge, “The tops in flops,” Spectrum, IEEE, vol. 48, no. 2, pp. 48–54, February
2011.
[17] “https://www.congress.gov/bill/113th-congress/house-bill/540.”
[18] “Clicking clean: How companies are creating the green internet,” Greenpeace In-
ternational, Tech. Rep., April 2014.
[19] “Clicking clean: A guide to building the green internet, year = 2015, institution =
Greenpeace International, month = May,,” Tech. Rep.
[20] Z. Miners. Greenpeace fingers Youtube, Netflix as threat to greener internet.
PCWorld. [Online]. Available: http://www.pcworld.idg.com.au/article/574853/
greenpeace-fingers-youtube-netflix-threat-greener-internet/
[21] Apache Hadoop. [Online]. Available: http://hadoop.apache.org/
[22] Amazon web services. [Online]. Available: http:/aws.amazon.com/
[23] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clus-
ters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[24] J. Leverich and C. Kozyrakis, “On the energy (in) efficiency of hadoop clusters,”
ACM SIGOPS Operating Systems Review, vol. 44, no. 1, pp. 61–65, 2010.
[25] N. Zhu, L. Rao, X. Liu, J. Liu, and H. Guan, “Taming power peaks in mapreduce
clusters,” in ACM SIGCOMM Computer Communication Review, vol. 41, no. 4.
ACM, 2011, pp. 416–417.
[26] D. Abts, M. Marty, P. Wells, P. Klausler, and H. Liu, “Energy proportional data-
center networks,” in ACM SIGARCH Computer Architecture News, vol. 38, no. 3.
ACM, 2010, pp. 338–347.
Bibliography 163
[27] M. Gupta and S. Singh, “Using low-power modes for energy conservation in ethernet
lans.” in INFOCOM, vol. 7, 2007, pp. 2451–55.
[28] S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy, and D. Wetherall, “Reducing
network energy consumption via sleeping and rate-adaptation.” in NSDI, vol. 8,
2008, pp. 323–336.
[29] C. Gunaratne, K. Christensen, B. Nordman, and S. Suen, “Reducing the energy
consumption of ethernet with adaptive link rate (alr),” IEEE Transactions on Com-
puters, vol. 57, no. 4, 2008.
[30] A. Carrega, S. Singh, R. Bruschi, and R. Bolla, “Traffic merging for energy-efficient
datacenter networks,” in International Symposium on Performance Evaluation of
Computer and Telecommunication Systems. IEEE, 2012, pp. 1–5.
[31] R. Shields, “Cultural topology: The seven bridges of konigsburg, 1736,” Theory,
Culture & Society, vol. 29, no. 4-5, pp. 43–57, 2012.
[32] L. Euler, The seven bridges of Konigsberg. Wm. Benton, 1956.
[33] S. C. Carlson. Graph theory. Encyclopedia Britannica. [Online]. Available:
http://www.britannica.com/topic/graph-theory
[34] T. H. Cormen, Introduction to algorithms. MIT press, 2009.
[35] A. Paz and S. Moran, Non-deterministic polynomial optimization problems and
their approximation. Springer, 1977.
[36] H. Whitney, “On the abstract properties of linear dependence,” American Journal
of Mathematics, vol. 57, no. 3, pp. 509–533, 1935.
[37] J. Oxley, Matroid theory. Oxford University Press, USA, 2006.
Bibliography 164
[38] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency. Springer
Verlag, 2003.
[39] “http://www.prnewswire.com/news-releases/altiors-altrastar—hadoop-storage-
accelerator-and-optimizer-now-certified-on-cdh4-clouderas-distribution-including-
apache-hadoop-version-4-183906141.html.”
[40] “https://www.facebook.com/notes/facebook-engineering/under-the-hood-
scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920.”
[41] T. White, Hadoop: The definitive guide. USA: ” O’Reilly Media, Inc.”, 2012.
[42] P. Costa, A. Donnelly, A. Rowstron, and G. OShea, “Camdoop: Exploiting in-
network aggregation for big data applications,” in USENIX NSDI, vol. 12, 2012,
pp. 29–42.
[43] L. Mai, L. Rupprecht, A. Alim, P. Costa, M. Migliavacca, P. Pietzuch, and A. L.
Wolf, “Netagg: Using middleboxes for application-specific on-path aggregation in
data centres,” in Proceedings of the 10th ACM International on Conference on
emerging Networking Experiments and Technologies. ACM, 2014, pp. 249–262.
[44] Y. Yu, P. Kumar, and M. Isard, “Distributed aggregation for data parallel com-
puting,” in SOSP, vol. 9, 2009, pp. 11–14.
[45] Z. Asad and M. A. R. Chaudhry, “A two-way street: Green big data processing for
a greener smart grid,” IEEE Systems Journal, vol. PP, no. 99, pp. 1–11, 2016.
[46] Cisco, “Power management in the Cisco unified computing system: An integrated
approach,” White Paper, C11-627731-01, Cisco, USA, Feburary 2011.
[47] L. Barroso, J. Clidaras, and U. Holzle, The datacenter as a computer: an intro-
duction to the design of warehouse-scale machines. USA: Morgan & Claypool
Publishers, 2013.
Bibliography 165
[48] D. Chernicoff, The shortcut guide to data center energy efficiency. Realtimepub-
lishers. com, 2009.
[49] R. Talaber, T. Brey, and L. Lamers, “Using virtualization to improve data center
efficiency,” White Paper, 19, The Green Grid, USA.
[50] Y. Shang, D. Li, and M. Xu, “Energy-aware routing in data center network,” in
ACM SIGCOMM workshop on Green networking, 2010, pp. 1–8.
[51] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee,
and N. McKeown, “Elastictree: Saving energy in data center networks,” in NSDI,
vol. 10, 2010, pp. 249–264.
[52] A. Greenberg, J. Hamilton, D. Maltz, and P. Patel, “The cost of a cloud: research
problems in data center networks,” ACM SIGCOMM computer communication
review, vol. 39, no. 1, pp. 68–73, 2008.
[53] Z. Asad, M. A. R. Chaudhry, and D. Malone, “Greener data exchange in the cloud:
A coding-based optimization for big data processing,” IEEE Journal on Selected
Areas in Communications, vol. 34, no. 5, pp. 1360–1377, May 2016.
[54] ——, “Codhoop: A system for optimizing big data processing,” in 2015 Annual
IEEE Systems Conference (SysCon) Proceedings, April 2015, pp. 295–300.
[55] ——, “A coding based optimization for hadoop,” in Extended Abstract and poster
ACM/IEEE International Conference for High Performance Computing, Network-
ing, Storage and Analysis (SC), 2015.
[56] “Cisco global cloud index: Forecast and methodology, 20122017,” 2013, White
Paper. [Online]. Available: http://www.cisco.com/c/en/us/solutions/collateral/
service-provider/global-cloud-index-gci/Cloud Index White Paper.pdf/
Bibliography 166
[57] M. Chowdhury, M. Zaharia, J. Ma, M. Jordan, and I. Stoica, “Managing data trans-
fers in computer clusters with orchestra,” SIGCOMM-Computer Communication
Review, vol. 41, no. 4, pp. 98–109, 2011.
[58] C. Gunaratne, K. Christensen, and B. Nordman, “Managing energy consumption
costs in desktop PCs and LAN switches with proxying, split TCP connections,
and scaling of link speed,” International Journal of Network Management, vol. 15,
no. 5, pp. 297–310, 2005.
[59] “https://polastre.com/2010/03/establishing-performance-metrics-for-the-data-
centre/.”
[60] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz,
P. Patel, and S. Sengupta, “Vl2: a scalable and flexible data center network,” in
ACM SIGCOMM Computer Communication Review, vol. 39, no. 4. ACM, 2009,
pp. 51–62.
[61] A. Greenberg, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta, “Towards a next
generation data center architecture: scalability and commoditization,” in Workshop
on Programmable routers for extensible services of tomorrow. ACM, 2008.
[62] Apache storm. [Online]. Available: http://storm-project.net/
[63] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-
parallel programs from sequential building blocks,” in ACM SIGOPS Operating
Systems Review, 2007, pp. 59–72.
[64] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, and J. Cur-
rey, “Dryadlinq: A system for general-purpose distributed data-parallel computing
using a high-level language,” in USENIX OSDI, 2008, pp. 1–14.
Bibliography 167
[65] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica,
“Dominant resource fairness: fair allocation of multiple resource types,” in USENIX
NSDI, 2011, pp. 323–336.
[66] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg,
“Quincy: fair scheduling for distributed computing clusters,” in ACM SIGOPS.
ACM, 2009, pp. 261–276.
[67] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica,
“Delay scheduling: a simple technique for achieving locality and fairness in cluster
scheduling,” in European conference on Computer systems. ACM, 2010, pp. 265–
278.
[68] A. Das, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and C. Yu, “Transparent and
flexible network management for big data processing in the cloud,” in USENIX
Workshop on Hot Topics in Cloud Computings. USENIX, 2013.
[69] A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha, “Sharing the data center
network,” in USENIX NSDI, pp. 309–322.
[70] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera:
Dynamic flow scheduling for data center networks,” in NSDI, vol. 10, 2010, pp.
19–19.
[71] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fain-
man, G. Papen, and A. Vahdat, “Helios: a hybrid electrical/optical switch archi-
tecture for modular data centers,” ACM SIGCOMM Computer Communication
Review, vol. 41, no. 4, pp. 339–350, 2011.
[72] G. Wang, T. E. Ng, and A. Shaikh, “Programming your network at run-time for
big data applications,” in HotSDN. ACM, 2012, pp. 103–108.
Bibliography 168
[73] M. Chaudhry, Z. Asad, A. Sprintson, and M. Langberg, “On the Complementary
Index Coding Problem,” in IEEE ISIT 2011.
[74] ——, “Finding sparse solutions for the index coding problem,” in IEEE GLOBE-
COM 2011.
[75] A. Curtis, K. Wonho, and P. Yalagandula, “Mahout: Low-overhead datacenter
traffic management using end-host-based elephant detection,” in IEEE INFOCOM
2011.
[76] D. Perino, M. Varvello, and K. Puttaswamy, “ICN-RE: redundancy elimination
for information-centric networking,” in ACM SIGCOMM ICN workshop, 2012, pp.
91–96.
[77] Cisco WAAS 4.4.1 context-aware DRE, the adaptive cache architecture. Cisco.
[Online]. Available: http://www.cisco.com/c/en/us/products/collateral/routers/
wide-area-application-services-waas-software/white\ paper\ c11-676350.html
[78] Steelhead. Riverbed. [Online]. Available: http://www.riverbed.com/products/
wan-optimization/
[79] M. V. Neves, C. A. De Rose, K. Katrinis, and H. Franke, “Pythia: Faster big data
in motion through predictive software-defined network optimization at runtime,”
in IEEE IPDPS, 2014, pp. 82–90.
[80] A. Das, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and C. Yu, “Transparent and
flexible network management for big data processing in the cloud,” in USENIX
HotCloud, 2013.
[81] H. Xu and B. Li, “Tinyflow: Breaking elephants down into mice in data center
networks,” in Local & Metropolitan Area Networks (LANMAN), 2014 IEEE 20th
International Workshop on. IEEE, 2014, pp. 1–6.
Bibliography 169
[82] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu,
“Bcube: a high performance, server-centric network architecture for modular data
centers,” ACM SIGCOMM Computer Communication Review, vol. 39, no. 4, 2009.
[83] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center net-
work architecture,” in ACM SIGCOMM Computer Communication Review, vol. 38,
no. 4. ACM, 2008, pp. 63–74.
[84] F. A. Silva, A. Boukerche, T. R. Silva, L. B. Ruiz, E. Cerqueira, and A. A. Loureiro,
“Vehicular networks: A new challenge for content-delivery-based applications,”
ACM Computing Surveys (CSUR), vol. 49, no. 1, p. 11, 2016.
[85] M. Li, Z. Yang, and W. Lou, “Codeon: Cooperative popular content distribution for
vehicular networks using symbol level network coding,” IEEE Journal on Selected
Areas in Communications, vol. 29, no. 1, pp. 223–235, January 2011.
[86] B. Hassanabadi and S. Valaee, “Reliable periodic safety message broadcasting in
vanets using network coding,” IEEE Transactions on Wireless Communications,
vol. 13, no. 3, pp. 1284–1297, March 2014.
[87] A. Khlass, T. Bonald, and S. E. Elayoubi, “Performance evaluation of intra-site
coordination schemes in cellular networks,” Performance Evaluation, vol. 98, pp.
1–18, 2016.
[88] S. Katti, H. Rahul, W. Hu, D. Katabi, M. Medard, and J. Crowcroft, “Xors in the
air: practical wireless network coding,” in ACM SIGCOMM computer communi-
cation review, vol. 36, no. 4, 2006, pp. 243–254.
[89] I.-H. Hou, Y.-E. Tsai, T. F. Abdelzaher, and I. Gupta, “Adapcode: Adaptive
network coding for code updates in wireless sensor networks,” in INFOCOM 2008.
The 27th Conference on Computer Communications. IEEE. IEEE, 2008.
Bibliography 170
[90] J. Li and B. Li, “Cooperative repair with minimum-storage regenerating codes
for distributed storage,” in IEEE INFOCOM 2014-IEEE Conference on Computer
Communications. IEEE, 2014, pp. 316–324.
[91] Y. Wu, “Existence and construction of capacity-achieving network codes for dis-
tributed storage,” in IEEE International Symposium on Information Theory, 2009,
pp. 1150–1154.
[92] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. Dimakis, R. Vadali, S. Chen,
and D. Borthakur, “Xoring elephants: Novel erasure codes for big data,” Proc.
VLDB Endow., vol. 6, no. 5, Mar. 2013.
[93] R. Ahlswede, N. Cai, S. Li, and R. Yeung, “Network information flow,” IEEE
Transactions on Information Theory, vol. 46, no. 4, 2000.
[94] Y. Birk and T. Kol, “Informed-Source Coding-on-Demand (ISCOD) over Broadcast
Channels,” in Proceedings of INFOCOM’98, 1998.
[95] Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol, “Index Coding with Side In-
formation,” In Proceedings of 47th Annual IEEE Symposium on Foundations of
Computer Science(FOCS), pp. 197–206, 2006.
[96] S. Katti, H. Rahul, W. Hu, D. Katabi, M. Medard, and J. Crowcroft, “XORs in
the Air: Practical Wireless Network Coding,” in Proceedings of SIGCOMM ’06,
2006, pp. 243–254.
[97] M. Chaudhry and A. Sprintson, “Efficient algorithms for index coding,” in IEEE
INFOCOM Workshops 2008.
[98] S. Sahoo, D. Nikovski, T. Muso, and K. Tsuru, “Electricity theft detection using
smart meter data,” in IEEE ISGT, 2015, pp. 1–5.
Bibliography 171
[99] Irish social science data archive. ISSDA. [Online]. Available: http://www.ucd.ie/
issda/
[100] Apache storm concepts. [Online]. Available: http://storm.apache.org/about/
simple-api.html
[101] “http://cassandra.apache.org/.”
[102] R. Lapuh, Y. Zhao, W. Tawbi, J. M. Regan, and D. Head, “System, device, and
method for improving communication network reliability using trunk splitting,”
Feb. 6 2007, uS Patent 7,173,934.
[103] “http://www.cisco.com/c/en/us/td/docs/solutions/enterprise/data center/dc infra2 5/dcinfra 4.pdf.”
[104] “Doctors will routinely use your dna to keep you well,” IBM Research, Tech. Rep.,
2013.
[105] “https://storm.incubator.apache.org/documentation/concepts.html.”
[106] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network Information Flow,”
IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1204–1216, 2000.
[107] H. Maleki, V. R. Cadambe, and S. Jafar, “Index coding: An interference alignment
perspective,” IEEE Transactions on Information Theory, vol. 60, no. 9, pp. 5402–
5432, 2014.
[108] M. Effros, S. El Rouayheb, and M. Langberg, “An equivalence between network
coding and index coding,” vol. 61, no. 3, 2015.
[109] S. El Rouayheb, A. Sprintson, and C. Georghiades, “On the index coding problem
and its relation to network coding and matroid theory,” IEEE Transactions on
Information Theory, vol. 56, no. 7, pp. 3187–3195, 2010.
Bibliography 172
[110] R. Koetter and M. Medard, “An algebraic approach to network coding,”
IEEE/ACM Transactions on Networking (TON), vol. 11, no. 5, pp. 782–795, 2003.
[111] B. Hassanabadi, L. Zhang, and S. Valaee, “Index coded repetition-based MAC in
vehicular ad-hoc networks,” in Proceedings of the 6th IEEE Conference on Con-
sumer Communications and Networking Conference. Institute of Electrical and
Electronics Engineers Inc., The, 2009, pp. 1180–1185.
[112] S. Sorour and S. Valaee, “Adaptive network coded retransmission scheme for wire-
less multicast,” in Proceedings of the 2009 IEEE international Symposium on In-
formation Theory. Institute of Electrical and Electronics Engineers Inc., The,
2009, pp. 2577–2581.
[113] N. Chiba and T. Nishizeki, “Arboricity and subgraph listing algorithms,” SIAM
Journal on Computing, vol. 14, no. 1, pp. 210–223, 1985.
[114] Sequence file reader. [Online]. Available: https://hadoop.apache.org/docs/r1.0.4/
api/org/apache/hadoop/io/package-summary.html
[115] Hortonworks Data Platform System Administration Guides. USA: Hortonworks,
2015.
[116] J. Norris. (2013) Package org.apache.hadoop.examples.terasort. Apache Hadoop.
[Online]. Available: https://hadoop.apache.org/docs/current/api/org/apache/
hadoop/examples/terasort/package-summary.html
[117] Cisco, “Big data in the enterprise - network design considerations,” White Paper,
USA.
[118] W. Yu, Y. Wang, and X. Que, “Design and evaluation of network-levitated merge
for hadoop acceleration,” IEEE Transactions on Parallel and Distributed Systems,,
vol. 25, no. 3, pp. 602–611, 2014.
Bibliography 173
[119] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar, “Shuffle-
watcher: Shuffle-aware scheduling in multi-tenant mapreduce clusters,” in Annual
Technical Conference. USENIX, 2014, pp. 1–12.
[120] F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Mapreduce with commu-
nication overlap (marco),” Journal of Parallel and Distributed Computing, vol. 73,
no. 5, pp. 608–620, 2013.
[121] redhat. [Online]. Available: http://www.redhat.com/
[122] J. W. Eaton, D. Bateman, and S. Hauberg, GNU Octave version 3.0.1 manual:
a high-level interactive language for numerical computations. CreateSpace
Independent Publishing Platform, 2009, ISBN 1441413006. [Online]. Available:
http://www.gnu.org/software/octave/doc/interpreter
[123] CentOS. [Online]. Available: https://www.centos.org/
[124] Citrix. [Online]. Available: http://www.citrix.com/products/xenserver/
[125] XenServer. [Online]. Available: http://xenserver.org/partners/
developing-products-for-xenserver.html
[126] Open vSwitch. [Online]. Available: http://openvswitch.org/
[127] P. Mahadevan, P. Sharma, S. Banerjee, and P. Ranganathan, “A power bench-
marking framework for network devices,” in Networking 2009. Springer, 2009.
[128] B. Zhang, K. Sabhanatarajan, A. Gordon-Ross, and A. George, “Real-time perfor-
mance analysis of adaptive link rate,” in Local Computer Networks. IEEE, 2008,
pp. 282–288.
[129] C. Gunaratne and K. Christensen, “Ethernet adaptive link rate: System design
and performance evaluation,” in Local Computer Networks, Proceedings 2006 31st
IEEE Conference on. IEEE, 2006, pp. 28–35.
Bibliography 174
[130] Z. Asad, M. A. R. Chaudhry, and D. Kundur, “On the use of matroid theory for
distributed cyber-physical-constrained generator scheduling in smart grid,” IEEE
Transactions on Industrial Electronics, vol. 62, no. 1, pp. 299–309, Jan 2015.
[131] L. Bishop, “Energy procurement: The new data center power model,” The Data
Cnter Journal, 2014.
[132] M.-A. Wolf, “Environmentally sound data centres: Policy measures, metrics, and
methodologies,” Tech. Rep. 04.07.2014, April, 2014.
[133] N. Darukhanawalla, “Geographically dispersed data centers,” Tech. Rep., Sep,
2009.
[134] Z. Liu, M. Lin, A. Wierman, S. H. Low, and L. L. Andrew, “Geographical load
balancing with renewables,” ACM SIGMETRICS Performance Evaluation Review,
vol. 39, no. 3, pp. 62–66, 2011.
[135] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs, “Cutting the
electric bill for internet-scale systems,” in ACM SIGCOMM computer communica-
tion review, vol. 39, no. 4. ACM, 2009, pp. 123–134.
[136] “State Energy Conservation Office,” http://www.seco.cpa.state.tx.us/.
[137] R. Bianchini, “Leveraging renewable energy in data centers: present and future,”
in Proceedings of the 21st international symposium on High-Performance Parallel
and Distributed Computing. ACM, 2012, pp. 135–136.
[138] K.-K. Nguyen, M. Cheriet, M. Lemay, M. Savoie, and B. Ho, “Powering a data
center network via renewable energy: A green testbed,” IEEE Internet Computing,
vol. 17, no. 1, pp. 40–49, 2013.
Bibliography 175
[139] B. Aksanli, T. Rosing, and E. Pettis, “Distributed battery control for peak
power shaving in datacenters,” in 2013 International Green Computing Confer-
ence (IGCC). IEEE, 2013, pp. 1–8.
[140] M. Ghamkhari, H. Mohsenian-Rad, and A. Wierman, “Optimal risk-aware power
procurement for data centers in day-ahead and real-time electricity markets,” in
IEEE Conference on Computer Communications Workshops. IEEE, 2014, pp.
610–615.
[141] H. Xu and B. Li, “Reducing electricity demand charge for data centers with partial
execution,” in International conference on Future energy systems. ACM, 2014,
pp. 51–61.
[142] Z. Liu, A. Wierman, Y. Chen, B. Razon, and N. Chen, “Data center demand
response: Avoiding the coincident peak via workload shifting and local generation,”
SIGMETRICS Performance Evaluation Review, vol. 41, no. 1, pp. 341–342, 2013.
[143] Z. Liu, I. Liu, S. Low, and A. Wierman, “Pricing data center demand response,” in
ACM SIGMETRICS Performance Evaluation Review, vol. 42, no. 1. ACM, 2014,
pp. 111–123.
[144] S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, and P. Mar-
wedel, “Reducing energy consumption by dynamic copying of instructions onto
onchip memory,” in International Symposium on System Synthesis. IEEE, 2002,
pp. 213–218.
[145] E. Pinheiro, R. Bianchini, E. V. Carrera, and T. Heath, “Dynamic cluster recon-
figuration for power and performance,” in Compilers and operating systems for low
power. Springer, 2003, pp. 75–93.
Bibliography 176
[146] F. Pilo, G. Pisano, and G. Soma, “Optimal coordination of energy resources with a
two-stage online active management,” IEEE Trans. Ind. Electron., vol. 58, no. 10,
pp. 4526–4537, 2011.
[147] A. Timbus, M. Larsson, and C. Yuen, “Active management of distributed energy
resources using standardized communications and modern information technolo-
gies,” IEEE Trans. Ind. Electron., vol. 56, no. 10, pp. 4029–4037, 2009.
[148] P. C. Loh, L. Zhang, and F. Gao, “Compact integrated energy systems for dis-
tributed generation,” IEEE Trans. Ind. Electron., vol. 60, no. 4, pp. 1492–1502,
April 2013.
[149] E. Pouresmaeil, C. Miguel-Espinar, M. Massot-Campos, D. Montesinos-Miracle,
and O. Gomis-Bellmunt, “A control technique for integration of dg units to the
electrical networks,” IEEE Trans. Ind. Electron., vol. 60, no. 7, pp. 2881–2893,
July 2013.
[150] A. Pantoja and N. Quijano, “A population dynamics approach for the dispatch of
distributed generators,” IEEE Trans. Ind. Electron., vol. 58, no. 10, pp. 4559–4567,
2011.
[151] D. Q. Hung and N. Mithulananthan, “Multiple distributed generator placement
in primary distribution networks for loss reduction,” IEEE Trans. Ind. Electron.,
vol. 60, no. 4, pp. 1700–1708, April 2013.
[152] S. Paudyal, C. Canizares, and K. Bhattacharya, “Optimal operation of distribution
feeders in smart grids,” IEEE Trans. Ind. Electron., vol. 58, no. 10, pp. 4495–4503,
2011.
[153] S. Bruno, S. Lamonaca, G. Rotondo, U. Stecchi, and M. La Scala, “Unbalanced
three-phase optimal power flow for smart grids,” IEEE Trans. Ind. Electron.,
vol. 58, no. 10, pp. 4504–4513, 2011.
Bibliography 177
[154] S. Chen and H. Gooi, “Jump and shift method for multi-objective optimization,”
IEEE Trans. Ind. Electron., vol. 58, no. 10, pp. 4538–4548, 2011.
[155] A. Bhardwaj, N. S. Tung, and V. Kamboj, “Unit commitment in power system: A
review,” Int. J. Elect. Power Engg., vol. 6, no. 1, pp. 51–57, 2012.
[156] H. Happ, “Optimal power dispatch: A comprehensive survey,” IEEE Trans. on
Power App. and Sys., vol. 96, no. 3, pp. 841–854, 1977.
[157] D. Kirshen and G. Strbac, Fundamentals of Power System Economics. John Wiley
& Sons Ltd, 2004.
[158] J. Lopez, J. Meza, I. Guillen, and R. Gomez, “A heuristic algorithm to solve the
unit commitment problem for real-life large-scale power systems,” Int. J. Elect.
Power Energy Syst., vol. 49, pp. 287–295, 2013.
[159] S. Ng and J. Zhong, “Security-constrained dispatch with controllable loads for
integrating stochastic wind energy,” in IEEE PES ISGT 2012.
[160] K. Zare and S. Hashemi, “A solution to transmission-constrained unit commitment
using hunting search algorithm,” in IEEE EEEIC, 2012, pp. 941–946.
[161] R. Bacher, “Power system models, objectives and constraints in optimal power
flow calculations,” in Optimization in Planning and Operation of Electric Power
Systems. Springer, 1993, pp. 217–263.
[162] F. Capitanescu, W. Rosehart, and L. Wehenkel, “Optimal power flow computations
with constraints limiting the number of control actions,” in IEEE PowerTech, 2009,
pp. 1–8.
[163] E. N. Azadani, S. H. Hosseinian, P. H. Divshali, and B. Vahidi, “Stability con-
strained optimal power flow in deregulated power systems,” Elec. Power Compo-
nents and Systems, vol. 39, no. 8, pp. 713–732, 2011.
Bibliography 178
[164] J. J. Shaw, “A direct method for security-constrained unit commitment,” IEEE
Trans. Power Sys., vol. 10, no. 3, pp. 1329–1342, 1995.
[165] A. Chaouachi, R. Kamel, R. Andoulsi, and K. Nagasaka, “Multiobjective intelligent
energy management for a microgrid,” IEEE Trans. Ind. Electron., vol. 60, no. 4,
pp. 1688–1699, April 2013.
[166] R. M. Karp, Reducibility among Combinatorial Problems. Springer, 1972.
[167] R. Kannan and C. Monma, “On the computational complexity of integer
programming problems,” in Optimization and Operations Research, ser. Lecture
Notes in Economics and Mathematical Systems, R. Henn, B. Korte, and W. Oettli,
Eds. Springer Berlin Heidelberg, 1978, vol. 157, pp. 161–172. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-95322-4 17
[168] S. S. Skiena, The Algorithm Design Manual, 2nd ed. Springer Publishing Company,
Incorporated, 2008.
[169] R. Impagliazzo and R. Paturi, “Complexity of k-sat,” in Computational Complexity,
1999. Proceedings. Fourteenth Annual IEEE Conference on. IEEE, 1999, pp. 237–
240.
[170] R. Impagliazzo, S. Lovett, R. Paturi, and S. Schneider, “0-1 integer linear pro-
gramming with a linear number of constraints,” arXiv preprint arXiv:1401.5512,
2014.
[171] L. Lawler, “Matroid intersection algorithms,” Mathematical Programming, vol. 9,
1975.
[172] C. M. Papadimitriou, Computational Complexity. Reading, Massachusetts:
Addison-Wesley, 1994.
Bibliography 179
[173] B. Krishnamoorthy, “Bounds on the size of branch-and-bound proofs for integer
knapsacks,” Operations Research Letters, vol. 36, no. 1, pp. 19–25, 2008.
[174] S. Bu, F. R. Yu, P. X. Liu, and P. Zhang, “Distributed scheduling in smart grid
communications with dynamic power demands and intermittent renewable energy
resources,” in IEEE Int. Conf. on Comm. Workshops. IEEE, 2011, pp. 1–5.
[175] C. Gong, X. Wang, W. Xu, and A. Tajer, “Distributed real-time energy scheduling
in smart grid: Stochastic model and fast optimization,” IEEE Trans. Smart Grid,
vol. 4, no. 3, pp. 1476–1489, 2013.
[176] R. E. Bixby, W. Cook, A. Cox, and E. K. Lee, “Parallel mixed integer program-
ming,” Rice University Center for Research on Parallel Computation Research
Monograph CRPC-TR95554, 1995.
[177] J. T. Linderoth and M. W. P. Savelsbergh, “A computational study of search
strategies for mixed integer programming,” INFORMS J. on Computing, vol. 11,
pp. 173–187, 1997.
[178] J. Edmonds, “Matroid Intersection,” Annals of discrete Mathematics, vol. 4, pp.
39–49, 1979.
[179] S. Haldar, “An All Pairs Paths Distributed Algorithm using 2n2 Messages,” in
Graph-Theoretic Concepts in Computer Science. Springer, 1994, pp. 350–363.
[180] H. Hamacher, “A Time Expanded Matroid Algorithm for Finding Optimal Dy-
namic Matroid Intersections,” Mathematical Methods of Operations Research,
vol. 29, no. 5, pp. 203–215, 1985.
[181] D. Zuckerman, “Linear degree extractors and the inapproximability of max clique
and chromatic number,” in Proc. of ACM Symposium on Theory of Computing,
2006.
Bibliography 180
[182] IEEE Power Systems Test Case Archive. [Online]. Available: http://www.ee.
washington.edu/research/pstca/index.html/
[183] A. Diniz, “Test cases for unit commitment and hydrothermal scheduling problems,”
in Proc. IEEE Power Eng. Soc. Gen. Meeting, 2010.
[184] A. Singla, C. Hong, L. Popa, and P. Godfrey, “Jellyfish: Networking data centers
randomly,” in NSDI. USENIX, 2012, pp. 225–238.
[185] Ieee p802.3az energy efficient ethernet task force. [Online]. Available: www.
ieee802.org/3/az
[186] Solar-powered data centers. Data Center Knowledge. [Online]. Available:
http://www.datacenterknowledge.com/solar-powered-data-centers/
[187] V. Valancius, N. Laoutaris, L. Massoulie, C. Diot, and P. Rodriguez, “Greening
the internet with nano data centers,” in International conference on Emerging
networking experiments and technologies. ACM, 2009, pp. 37–48.
[188] B. Khargharia, S. Hariri, F. Szidarovszky, M. Houri, H. El-Rewini, S. Khan, I. Ah-
mad, and M. Yousif, “Autonomic power & performance management for large-scale
data centers,” in IEEE IPDPS. IEEE, 2007.
[189] L. Mastroleon, N. Bambos, C. Kozyrakis, and D. Economou, “Automatic power
management schemes for internet servers and data centers,” in IEEE GLOBECOM,
vol. 2. IEEE, 2005, pp. 5–pp.
[190] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, “No power
struggles: Coordinated multi-level power management for the data center,” in ACM
SIGARCH Computer Architecture News, vol. 36, no. 1. ACM, 2008, pp. 48–59.
Bibliography 181
[191] A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy, “Optimal power allocation
in server farms,” in ACM SIGMETRICS Performance Evaluation Review, vol. 37,
no. 1, pp. 157–168.
[192] C. Hsu and W. Feng, “A power-aware run-time system for high-performance com-
puting,” in ACM/IEEE conference on Supercomputing, 2005, p. 1.
[193] Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, “Multiscale:
memory system dvfs with multiple memory controllers,” in IEEE international
symposium on low power electronics and design. ACM, 2012, pp. 297–302.
[194] L. Rao, X. Liu, L. Xie, and W. Liu, “Minimizing electricity cost: optimization
of distributed internet data centers in a multi-electricity-market environment,” in
Proceedings of IEEE INFOCOM. IEEE, 2010, pp. 1–9.
[195] D. Meisner, B. Gold, and T. Wenisch, “Powernap: eliminating server idle power,”
ACM SIGARCH Computer Architecture News, vol. 37, no. 1, pp. 205–216, 2009.
[196] L. Liu, H. Wang, X. Liu, X. Jin, W. He, Q. Wang, and Y. Chen, “Greencloud: a new
architecture for green data center,” in International conference industry session on
Autonomic computing and communications. ACM, 2009, pp. 29–38.
[197] H. Chen, M. Kesavan, K. Schwan, A. Gavrilovska, P. Kumar, and Y. Joshi,
“Spatially-aware optimization of energy consumption in consolidated data center
systems,” in ASME Pacific Rim Technical Conference and Exhibition on Pack-
aging and Integration of Electronic and Photonic Systems. American Society of
Mechanical Engineers, 2011, pp. 461–470.
[198] V. Anagnostopoulou, S. Biswas, A. Savage, R. Bianchini, T. Yang, and F. Chong,
“Energy conservation in datacenters through cluster memory management and
barely-alive memory servers,” in Workshop on Energy Efficient Design, 2009.
Bibliography 182
[199] A. Verma, G. Dasgupta, T. Nayak, P. De, and R. Kothari, “Server workload anal-
ysis for power minimization using consolidation,” in USENIX Annual technical
conference. USENIX Association, 2009, pp. 28–28.
[200] G. Da Costa, M. D. De Assuncao, J. Gelas, Y. Georgiou, L. Lefevre, A. Orgerie,
J. Pierson, O. Richard, and A. Sayah, “Multi-facet approach to reduce energy
consumption in clouds and grids: the green-net framework,” in International con-
ference on energy-efficient computing and networking. ACM, 2010, pp. 95–104.
[201] R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting platform heterogeneity for power
efficient data centers,” in International Conference on Autonomic Computing.
IEEE, 2007, pp. 5–5.
[202] M. Zapater, J. Ayala, and J. Moya, “Leveraging heterogeneity for energy minimiza-
tion in data centers,” in IEEE/ACM International Symposium on Cluster, Cloud
and Grid Computing. IEEE, 2012.
[203] K. Le, R. Bianchini, M. Martonosi, and T. Nguyen, “Cost- and energy-aware load
distribution across data centers,” HotPower, pp. 1–5, 2009.
[204] J. Berral, I. Goiri, R. Nou, F. Julia, J. Guitart, R. Gavalda, and J. Torres, “Towards
energy-aware scheduling in data centers using machine learning,” in International
Conference on energy-Efficient Computing and Networking. ACM, 2010, pp. 215–
224.
[205] J. Kolodziej, S. Khan, and F. Xhafa, “Genetic algorithms for energy-aware schedul-
ing in computational grids,” in International Conference on P2P, Parallel, Grid,
Cloud and Internet Computing. IEEE, 2011, pp. 17–24.
[206] L. Keys, S. Rivoire, and J. Davis, “The search for energy-efficient building blocks for
the data center,” in International Conference on Computer Architecture. Springer,
2012, pp. 172–182.
Bibliography 183
[207] P. Microsystems, “2010 state of virtualization security survey,” Prism Microsys-
tems, USA, Tech. Rep., April 2010.
[208] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “Dcell: a scalable and fault-
tolerant network structure for data centers,” ACM SIGCOMM Computer Commu-
nication Review, vol. 38, no. 4, pp. 75–86, 2008.
[209] R. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan,
V. Subramanya, and A. Vahdat, “Portland: a scalable fault-tolerant layer 2 data
center network fabric,” in ACM SIGCOMM Computer Communication Review,
vol. 39, no. 4. ACM, 2009, pp. 39–50.
[210] virt-p2v and virt-v2v. [Online]. Available: http://libguestfs.org/virt-v2v/
[211] K. Bilal, S. Malik, O. Khalid, A. Hameed, E. Alvarez, V. Wijaysekara, R. Irfan,
S. Shrestha, D. Dwivedy, M. Ali et al., “A taxonomy and survey on green data
center networks,” Future Generation Computer Systems, vol. 36, pp. 189–208, 2013.
[212] K. Bilal, S. Khan, L. Zhang, H. Li, K. Hayat, S. Madani, N. Min-Allah, L. Wang,
D. Chen, M. Iqbal, C. Xu, and A. Zomaya, “Quantitative comparisons of the state-
of-the-art data center architectures,” Concurrency and Computation: Practice and
Experience, vol. 25, no. 12, pp. 1771–1783, 2013.
[213] X. Wang, Y. Yao, X. Wang, K. Lu, and Q. Cao, “Carpo: Correlation-aware power
optimization in data center networks,” in IEEE INFOCOM. IEEE, 2012.
[214] L. Shang, L. Peh, and N. Jha, “Dynamic voltage scaling with links for power
optimization of interconnection networks,” in International Symposium on High-
Performance Computer Architecture. IEEE, 2003.
Bibliography 184
[215] D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall, “Augmenting data
center networks with multi-gigabit wireless links,” in ACM SIGCOMM Computer
Communication Review, vol. 41, no. 4. ACM, 2011, pp. 38–49.
[216] J. Shin, E. Sirer, H. Weatherspoon, and D. Kirovski, “On the feasibility of com-
pletely wireless datacenters,” IEEE/ACM Transactions on Networking, vol. 21,
no. 5, pp. 1666–1679, 2013.
[217] Y. Cui, H. Wang, X. Cheng, and B. Chen, “Wireless data center networking,”
IEEE Wireless Communications, vol. 18, no. 6, 2011.
[218] M. Gharbaoui, B. Martini, D. Adami, G. Antichi, S. Giordano, and P. Castoldi,
“On virtualization-aware traffic engineering in openflow data centers networks,” in
NOMS. IEEE, 2014.
[219] M. Jarschel and R. Pries, “An openflow-based energy-efficient data center ap-
proach,” in Conference on Applications, technologies, architectures, and protocols
for computer communication. ACM, 2012, pp. 87–88.
[220] I. Goiri, K. Le, T. Nguyen, J. Guitart, J. Torres, and R. Bianchini, “Greenhadoop:
leveraging green energy in data-processing frameworks,” in European conference on
Computer Systems. ACM, 2012, pp. 57–70.
[221] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop
mapreduce,” in 2012 IEEE International Conference on Cluster Computing Work-
shops.
[222] E. Curry, S. Hasan, M. White, and H. Melvin, “An environmental chargeback
for data center and cloud computing consumers,” in International Conference on
Energy Efficient Data Centers. Springer, 2012, pp. 117–128.
Bibliography 185
[223] A. Souarez, “Charging customers for power usage in microsoft data centers.” Msdn,
2008.
[224] Cloud cruiser for amazon web services. [Online]. Available: http://www.
cloudcruiser.com/partners/amazon/
[225] Cisco, “Managing the real cost of on-demand enterprise cloud services with charge-
back models,” White Paper, C11-617390-00, Cisco, USA, August, 2010.
[226] V. Avelar, D. Azevedo, and A. French(Editors), “Pue: A comprehensive examina-
tion of the metric,” White Paper, 49, The Green Grid, USA, 2014.
[227] D. Azevedo, J. Cooley, M. Patterson, and M. Blackburn. (2011) Data center
efficiency metrics: mpue, partial pue, ere, dcce. The Green Grid. [Online].
Available: http://www.thegreengrid.org/∼/media/TechForumPresentations2011/
Data Center Efficiency Metrics 2011.pdf
[228] P. DELFORGE. (2015) America’s data centers consuming and wasting growing
amounts of energy. The Natural Resources Defense Council. [Online]. Available:
http://www.nrdc.org/energy/data-center-efficiency-assessment.asp
[229] Toronto hydro corporation. [Online]. Available: http://www.torontohydro.com/
[230] Powerstream. [Online]. Available: http://www.powerstream.ca/
[231] Data centre incentive program. [Online]. Available: http://www.cita.utoronto.ca/
∼dubinski/DCIP/DCIP\ SellSheet\ FINALcb.pdf
[232] M. Ghamkhari and H. Mohsenian-Rad, “Optimal integration of renewable energy
resources in data centers with behind-the-meter renewable generator,” in ICC.
IEEE, 2012.
Bibliography 186
[233] S. Ghamkhari, “Energy management and profit maximization of green data cen-
ters,” Master’s thesis, Electrical and Computer Engineering, Texas Tech University,
2012.
[234] Other world computing. [Online]. Available: http://owc.net/about.php
[235] Baryonyx corp. [Online]. Available: http://www.baryonyxcorp.com/
[236] Green house data. [Online]. Available: http://www.greenhousedata.com/
[237] Wind-powered data centers. Data Center Knowledge. [Online]. Available:
http://www.datacenterknowledge.com/wind-powered-data-centers/
[238] M. Langberg and A. Sprintson, “On the hardness of approximating the network
coding capacity,” in IEEE International Symposium on Information Theory(ISIT),
2008, 2008, pp. 315–319.