prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/hameed hussain.pdf · iii power...

140
POWER EFFICIENT RESOURCE ALLOCATION IN HIGH PERFORMANCE COMPUTING SYSTEMS By Hameed Hussain CIIT/FA09-PCS-003/ISB PhD thesis In Computer Science COMSATS Institute of Information Technology, Islamabad-Pakistan Fall, 2016

Upload: others

Post on 11-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

POWER EFFICIENT RESOURCE

ALLOCATION IN HIGH PERFORMANCE

COMPUTING SYSTEMS

By

Hameed Hussain

CIIT/FA09-PCS-003/ISB

PhD thesis

In

Computer Science

COMSATS Institute of Information Technology,

Islamabad-Pakistan

Fall, 2016

Page 2: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

ii

COMSATS Institute of Information Technology

Power Efficient Resource Allocation in High

Performance Computing Systems

A Thesis Presented to

COMSATS Institute of Information Technology, Islamabad

In partial fulfillment

of the requirement for the degree of

Ph.D Computer Science

By

Hameed Hussain

CIIT/FA09-PCS-003/ISB

Fall, 2016

Page 3: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

iii

Power Efficient Resource Allocation in High

Performance Computing Systems

________________________________________________________________________

A Post Graduate Thesis submitted to the Department of Computer Science as

partial fulfillment of the requirement for the award of Degree of Ph.D in Computer

Science.

Name

Registration Number

Hameed Hussain CIIT/FA09-PCS-003/ISB

Supervisor

Dr. Nasro Min Allah

Associate Professor,

Department of Computer Science

COMSATS Institute of Information Technology (CIIT)

Islamabad, Campus.

October, 2016

Page 4: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

iv

Certificate of Approval

This is to certify that the research work presented in this thesis, entitled “Power Efficient

Resource Allocation in High Performance Computing System” was conducted by

Hameed Hussain having registration number CIIT/FA09-PCS-003/ISB, under the

supervision of Dr. Nasro Min-Allah. No part of this thesis has been submitted anywhere

else for any other degree. This thesis is submitted to the Department of Computer

Science, COMSATS Institute of Information Technology, Islamabad, in the partial

fulfillment of the requirement for the degree of Doctor of Philosophy in the field of

Computer Science.

Student Name: Hameed Hussain Signature: __________________

Examination Committee:

Prof. Dr. Bhawani Shankar Chowdhry Prof. Dr. Malik Sikandar Hayat Khiyal

Dean Faculty of Electrical, Electronics Department of Computer Science

and Computer Engineering MUET, Preston University, Islamabad, Pakistan

Jamshoro, Pakistan

Dr. Nasro Min-Allah Dr. Majid Iqbal Khan

Supervisor HoD, Department of Computer Science,

Department of Computer Science, CIIT, Islamabad

CIIT, Islamabad

Prof. Dr. Zulfiqar Habib Prof. Dr. Syed Asad Hussain

Chairperson, Computer Science, Dean Faculty of Information Sciences

CIIT and Technology, CIIT

Page 5: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

v

Author’s Declaration

I Hameed Hussain, reg no. CIIT/FA09-PCS-003/ISB, hereby state that my PhD thesis

title “Power Efficient Resource Allocation in High Performance Computing Systems” is

my own work and has not been submitted previously by me for taking any degree from

this university i.e. COMSATS Institute of Information Technology or anywhere else in

the country/world.

At any time if my statement is found to be incorrect even after I graduate the University

has the right to withdraw my PhD degree.

Date: _____________________ Signature of the student

(Thesis submission)

Hameed Hussain

CIIT/FA09-PCS-003/ISB

Page 6: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

vi

Plagiarism Undertaking

I solemnly declare that research work presented in this thesis titled “Power Efficient

Resource Allocation in High Performance Computing Systems” is solely my research

work with no significant contribution from any other person. Small contribution/help

wherever taken has been duly acknowledged and that complete thesis has been written by

me.

I understand the zero tolerance policy of HEC and COMSATS Institute of Information

Technology towards plagiarism. Therefore, I as an author of the above titled thesis

declare that no portion of my thesis has been plagiarized and any material used as

reference is properly referred/cited.

I undertake if I am found guilty of any formal plagiarism in the above titled thesis even

after award of PhD Degree, the University reserves the right to withdraw/revoke my PhD

degree and that HEC and the university has the right to publish my name on the

HEC/university website on which name of students are placed who submitted plagiarized

thesis.

Date: _____________________ Signature of the student

(Thesis submission)

Hameed Hussain CIIT/FA09-PCS-003/ISB

Page 7: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

vii

Certificate

It is certified that Mr. Hameed Hussain reg. no. CIIT/FA09-PCS-003/ISB has carried

out all the work related to this thesis under my supervision at the Department of

Computer Science, COMSATS Institute of Information Technology, Islamabad and the

work fulfills the requirement for award of PhD degree.

Date: __________________ Supervisor:

(Thesis submission)

_________________________

Dr. Nasro Min Allah

Associate Professor,

Department of Computer Science

CIIT, Islamabad

Head of Department:

______________________________

Dr. Majid Iqbal Khan

Associate Professor,

Department of Computer Science,

CIIT, Islamabad

Page 8: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

viii

DEDICATION

To my parents who are like cool shade in the noontide of my life. Who sacrifice their

today for my tomorrow, particularly to my late mother, whose hands get tired of praying

for my success; and to those who pray for me and encourage me throughout my

educational career.

Page 9: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

ix

ACKNOWLEDGEMENTS

I offer heartiest “Drood-o-salam” to holly “Prophet Muhammad (peace be upon him)”. I

am grateful to almighty Allah who is merciful and beneficent, and who enable me to

work on this research successfully. Accomplishment of a research thesis requires the help

of many peoples who steer, guide, give confidence and help you. Many people guide and

support me. They were always there to help me out in the time of need. First, I would like

to express my sincere gratitude to my supervisors, “Dr. Nasro Min Allah and Dr.

Manzoor Illahi Tamimy” who are associate professors in COMSATS Institute of

Information Technology (CIIT) Islamabad for their esteemed supervision, encouragement

and guidance for successful completion of this research work. Secondly, I am grateful to

all faculty members of computer science department in CIIT Islamabad for their timely

and unconditional help. They always encouraged, and helped me in understanding my

research problems and guide me to cope with issues and problems faced during this

research work. I am also thankful to my friends specially “Muhammad Bilal Qureshi”

who encouraged me to complete this work. I am heartedly grateful to my parents,

brothers, sisters and wife for their gracious, unconditional support, prayers and

encouragement throughout my educational career. I am thankful to almighty Allah who

blessed me, with a daughter “Anfal” and two sons “Muhammad Bilal” and “Muhammad

Talha”. The love of my children encourages me and adds new motivation in my work

that was the real source of energy for giving final touch to the thesis.

Hameed Hussain,

CIIT/FA09-PCS-003/ISB

Page 10: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

x

ABSTRACT

Power Efficient Resource Allocation in High Performance

Computing Systems

An efficient resource allocation is a fundamental requirement in High Performance

Computing (HPC) systems. Many projects are dedicated to large-scale distributed

computing systems that have designed and developed resource allocation mechanisms

with a variety of architectures and services. Resource allocation mechanisms and

strategies play a vital role towards the performance improvement of all the high

performance computing classifications. Therefore, a comprehensive discussion of widely

used resource allocation strategies deployed in distributed high performance computing

environment is required. The author has classified the distributed high performance

computing systems into three broad categories, namely: (a) cluster, (b) grid, and (c) cloud

systems and defines the characteristics of each class by extracting sets of common

attributes. All of the aforementioned systems are cataloged into pure software and

hybrid/hardware solutions. The system classification is used to identify approaches

followed by the implementation of existing resource allocation strategies that are widely

presented in the literature.

More computational power is offered by high performance computing systems to cope

with CPU intensive applications. However, this facility comes at the price of more energy

consumption and eventually higher heat dissipation. As a remedy, these issues are be ing

encountered by adjusting system speed on the fly so that application deadlines are

respected and also, the overall system energy consumption is reduced. In addition, the

current state of the art of high performance computing, particularly the multi-core

technology opens further research opportunities for energy reduction through power

efficient scheduling. However, the multi-core front is relatively unexplored from the

perspective of task scheduling. To the best of our knowledge, very little is known as of

yet to integrate power efficiency component into real-time scheduling theory that is

tailored for high performance computing particularly multi-core platforms. In these

efforts the author first proposes a technique to find the least feasible speed to schedule

individual tasks. The proposed technique is experimentally evaluated and the results

show the supremacy of our approach over the existing counterpart called first feasible

speed. However, this solution is at the cost of delayed response time. The experimental

Page 11: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xi

results are in accordance with the mathematical formulation established in this work. To

minimize power consumption, the author did another attempt by applying genetic

algorithm on first feasible speed. The newly proposed approach is termed as genetic

algorithm with first feasible speed. The author compares the results obtained through

aforementioned approach with existing techniques. It is worth mentioning that proposed

technique outperforms first feasible speed and least feasible speed with respect to energy

consumption and response time perspectives respectively.

Load balancing is also vital for efficient and equal utilization of computing units

(systems or cores). To load balancing among computing units, the author applies lightest

task-migration (task-shifting) and task splitting mechanisms in a multi-core environment.

In task shifting a task having minimum load on a highly utilized computing unit is fully

transferred to a low utilized computing unit. In task splitting, the load of a task from a

highly utilized computing unit is shared among the computing unit and a low utilized

computing unit. It concludes from the given results that task splitting mechanism fully

balance load but it is more time consuming as compared to task shifting strategy.

Keywords: High performance computing, Real time system, Scheduling, Resource

allocation, Resource management, Genetic algorithm, Task migration and Task splitting.

Page 12: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xii

TABLE OF CONTENTS

Chapter 1: Introduction ..................................................................................1

1.1 Introduction ...................................................................................................2

1.2 Motivation ......................................................................................................6

1.3 Problem Statement .........................................................................................6

1.4 Research Issues ..............................................................................................7

1.5 Contributions of The Thesis ..........................................................................8

1.6 Organization of The Thesis ...........................................................................9

1.7 Summary ...................................................................................................... 10

Chapter 2: RA in High Performance Distributed Computing Systems ........ 11

2.1 Introduction ................................................................................................. 12

2.2 Overview of Distributed HPC Systems ...................................................... 13

2.2.1 Distributed HPC Systems Classes ........................................................... 15

2.2.2 Cluster Computer Systems: Features and Requirements ........................ 20

2.2.3 Grid Computer Systems: Features and Requirements ............................ 24

2.2.4 Cloud Computing Systems: Features and Requirements ........................ 28

2.3 Comparison and Survey of The Existing HPC Solutions ........................... 32

2.3.1 Cluster Computing Systems .................................................................... 33

2.3.2 Grid Computing Systems ......................................................................... 40

2.3.3 Cloud Computing Systems ...................................................................... 50

2.4 Classification of Systems ............................................................................ 54

2.4.1. Software Only Solutions .......................................................................... 55

2.4.2. Hardware/Hybrid Only Solutions ............................................................ 55

2.5 Conclusion of the Chapter ........................................................................... 56

Chapter 3: Power Efficient Resource Allocation Using LFS ........................ 57

3.1 Introduction .......................................................................................... 58

3.2 System Model and Background ............................................................. 59

3.3 Lowest Speed Calculations .................................................................... 62

3.4 Experimental Analysis .......................................................................... 65

3.4.1 Determining the Lowest Speed ................................................................ 65

3.4.2 Energy Savings ........................................................................................ 68

Page 13: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xiii

3.5 Task Partitioning in Multi-core Systems ................................................ 73

3.6 Task Mapping on Cores ......................................................................... 77

3.7 Conclusion of the Chapter ..................................................................... 80

Chapter 4: Power Efficient Resource Allocation Using GA-FFS ................. 82

4.1 Introduction ................................................................................................. 83

4.2 Proposed Work ............................................................................................ 83

4.2.1 Main Drivers of Genetic Algorithm ........................................................ 86

4.2.1.1 Cross Over ............................................................................................ 86

4.2.1.2 Mutation ................................................................................................ 87

4.2.2 Feasibility Checking Through GA-FFS Approach ................................. 88

4.3 Experimental Results and Analysis ............................................................ 89

4.4 Conclusion of the Chapter ........................................................................... 93

Chapter 5: Resource Allocation Using Load Balancing Mechanisms .......... 95

5.1 Introduction .......................................................................................... 96

5.2 Load Balancing Mechanisms ................................................................. 97

5.2.1 Task Migration or Task Shifting ............................................................. 98

5.2.2 Task Splitting ........................................................................................... 98

5.2.3 Explanation Through an Example ........................................................... 99

5.3 Results and Discussions ...................................................................... 100

5.4 Conclusion of the Chapter ................................................................... 104

Chapter 6: Conclusion, Recommendations and Future Directions ............ 105

6.1 Introduction ............................................................................................... 106

6.2 Conclusion ................................................................................................. 106

6.3 Recommendations...................................................................................... 107

6.4 Future Directions ....................................................................................... 108

Page 14: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xiv

LIST OF FIGURES

FIGURE 2.1: HPC SYSTEMS CATEGORIES AND ATTRIBUTES ................................................. 14

FIGURE 2.2: A CLUSTER COMPUTING SYSTEM ARCHITECTURE ........................................... 16

FIGURE 2.3: A MODEL OF GRID COMPUTING SYSTEM......................................................... 17

FIGURE 2.4: A LAYERED MODEL OF CLOUD COMPUTING SYSTEM ...................................... 19

FIGURE 2.5: A) SINGLE TASK JOB AND B) MULTIPLE TASK JOB ........................................... 22

FIGURE 2.6: RESOURCE MANAGEMENT (A) CENTRALIZED (B) DECENTRALIZED ................ 22

FIGURE 2.7: TAXONOMY OF RESOURCE ALLOCATION POLICIES .......................................... 27

FIGURE 3.1: GANTT CHART FOR 1 1.30,3 , 2 1.19,5 AND 3 1.19,10 . ........................... 64

FIGURE 3.2: GANTT CHART FOR 1 1.57,3 , 2 1.42,5 AND 3 1.42,10 ........................... 64

FIGURE 3.3: EFFECT OF UTILIZATION ON SYSTEM SPEED. ..................................................... 67

FIGURE 3.4:. POWER CONSUMPTION OF CRUSOE PROCESSOR AT RESPECTIVE VOLTAGE. ..... 69

FIGURE 3.5: NORMALIZED ENERGY CONSUMPTIONS ............................................................ 71

FIGURE 3.6: LFS AND FFS COMPARISON BASED ON REQUIRED EXECUTION TIME ............... 72

FIGURE 3.7: LOAD DISTRIBUTION ON SYSTEM WITH 8 CORES. .............................................. 78

FIGURE 3.8: LOAD DISTRIBUTION ON SYSTEM WITH 12 CORES. ........................................... 79

FIGURE 4.1: FLOW CHART OF GA-FFS ................................................................................. 84

FIGURE 4.2: PROCESS OF TOURNAMENT SELECTION ............................................................ 85

FIGURE 4.3: CROSS-OVER PROCESS ...................................................................................... 86

FIGURE 4.4: MUTATION PROCESS ......................................................................................... 87

FIGURE 4.5: TASK SET SIZE AGAINST REQUIRED SPEED (LFS, FFS, GA-FFS) .................... 90

FIGURE 4.6: ENERGY CONSUMPTION AGAINST TASK SET SIZE FOR LFS, FFS AND GA-FFS 91

FIGURE 4.7: EXECUTION TIME AGAINST TASK SET SIZE FOR LFS, FFS AND GA-FFS ......... 91

FIGURE 5.1: LOAD BALANCING MECHANISMS (TASK SHIFTING AND TASK SPLITTING) ........ 99

FIGURE 5.2: NUMBER OF TASKS ON CORES BEFORE LOAD BALANCING. ............................. 101

FIGURE 5.3: CORES UTILIZATION BEFORE LOAD BALANCING. ............................................ 101

FIGURE 5.4: NUMBER OF TASKS ON CORES AFTER TASK SHIFTING. .................................... 102

FIGURE 5.5: CORES UTILIZATION AFTER TASK SHIFTING. ................................................... 102

FIGURE 5.6: CORES UTILIZATION AFTER TASK SPLITTING .................................................. 103

_______________________________________________________________________________________________________

Page 15: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xv

LIST OF TABLES

TABLE 2.1: COMMONALITY BETWEEN CLUSTER, GRID, AND CLOUD SYSTEMS. ................... 20

TABLE 2.2: SURVEY OF THE EXISTING HPC SYSTEMS ...................................................... 32

TABLE 2.3: COMPARISON OF CLUSTER COMPUTING SYSTEMS .......................................... 34

TABLE 2.4: COMPARISON OF GRID COMPUTING SYSTEMS ................................................ 41

TABLE 2.5: COMPARISON OF CLOUD COMPUTING SYSTEMS ............................................. 52

TABLE 2.6: CLASSIFICATION OF GRID, CLOUD AND CLUSTER SYSTEMS ............................ 56

TABLE 3.1: OPERATIONAL LEVELS AND THE RESPECTIVE SPEED RANGES. ......................... 61

TABLE 5.1: OVERALL SIMULATION RESULTS ................................................................. 103

___________________________________________________________________________________________________________

Page 16: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xvi

LIST OF ABBREVIATIONS

Amazon EC2: Amazon Elastic Compute Cloud

AMIs: Amazon Machine Images

AOP: Application Oriented Policy

BNS: Broker Name Service

CC: Cluster Controller

CGs: Computational Grids

CIIT: COMSATS Institute of Information Technology

CMOS: Complementary Metal Oxide Silicon

CPU: Central Processing Unit

CSPs: Cloud Service Providers

DAG: Directed Acyclic Graph

DM: Deadline Monotonic

DQS: Distributed Queuing System

DVS: Dynamic Voltage Scaling

FCFS: First Come First Serve

FFS: First Feasible Speed

FORTRAN: Formula Translation

FS: File System

GA: Genetic Algorithm

GAE: Google Application Engine

GA-FFS: Genetic Algorithm with First Feasible Speed

GENI: Global Environment for Network Innovations

GHS: Grid Harvest Service

GNQS: Generic Network Queuing System

GQoSM: Grid Quality of Services Management

GRACE: Grid Architecture for Computational Economy

GRB: Grid Resource Broker

GRIP: GRid Information Protocol

GRRP: GRid Registration Protocol

HB: Hyperbolic Bound

H-FSC: Hierarchical Fair Service Curve

HP: Hewlett Packard

HPC: High Performance Computing

HTTP: Hypertext Transfer Protocol

IaaS: Infrastructure as a Service

IBM: International Business Machines

ICDIM: International Conference on Digital Information Management

IP: Internet Protocol

IT: Information Technology

JPDC: Journal of Parallel and Distributed Computing

JSON: JavaScript Object Notation

LFS: Least Feasible Speed

LL-bound: Liu and Leyland bound

LSF: Load Sharing Facility

MATLAB: Mathematics Laboratory

Page 17: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

xvii

MIPS: Million of Instruction Per Second

MOL: Meta Computing Online

MPI: Message Passing Interface

MTTF: Mean Time To Failure

NC: Node Controller

Ninf: Network Infrastructure

NIST: National Institute of Standards and Technology

NWS: Network Weather Service

OGSA: Open Grid Service Architecture

OpenSSI: Open Single System Image

ORB: Object Request Broker

OS: Operating System

PaaS: Platform as a Service

PARCO: Parallel Computing Journal

PBS: Portable Batch System

PC: Personal Computer

PDAs: Personal Digital Assistants

PUNCH: Purdue University Network Computing Hub

PVM: Parallel Virtual Machine

QoS: Quality of Service

RAS: Reliability, Availability and Serviceability

REST: Representational State Transfer

RM: Rate Monotonic

RMS: Resource Management System

S3: Simple Storage Service

SaaS: Software as a Service

SC: Sufficient Conditions

SLAs: Service Level Agreements

SLURM: Simple Linux Utility for Resource Management

SLURM: Simple Linux Utility for Resource Management

SMP: Symmetric Multi-Processing

SOAP: Simple Object Access Protocol

SP: System Provisioning

SSL: Secure Socket Layers

STC: Storage Controller

TAO: The Ace ORB

TCP: Transmission Control Protocol

TSWJ: The Scientific World Journal

URL: Unified Resource Locator

VGrADS: Virtual Grid Application Development Software

VLAN: Virtual Local Area Network

VM: Virtual Machine

WAN: Wide Area Network

Web API: Website Application Programming Interface

XML: Extensible Markup Language

_________________________________________________________________

Page 18: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

1

Chapter 1

Introduction

Page 19: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

2

1.1 Introduction

Improving computation power at minimum power (energy) consumption is the demand of

the day. Computational power enhances by using HPC systems. The HPC systems are

categorized into distributed and non-distributed [1, 2, 3]. By distribution, we mean

processors on different boards. Cluster, Grid and Cloud computing systems are under the

umbrella of distributed HPC systems [1], while multi-core [4] technology comes under

the category of non-distributed HPC systems. Clustering or Cluster computing is using

multiple storage and processing devices and interconnections between them to form a

single system image to outside world [1, 5]. The prime goal of cluster system is

integrating software, hardware and network resources for availability, load balancing and

performance improvement [1, 5, 6, 7]. The grid-computing concept is based on using

Internet as a medium [1]. In grid-computing powerful computing resources are connected

through the medium for wide spread availability [8]. Grid systems as opposed to cluster,

have different administrative domains, user privileges [1, 9, 10], heterogeneous, loosely

coupled and geographically spread [1, 11, 12]. Problem solving and resource sharing are

the primal motivations behind grid computing. For more detail about grid and its types,

see ref [13]. Services in cloud computing are provided though Internet. The services are

dynamically scalable and resources are virtualized over the Internet [1, 14, 15, 16, 17]. In

cloud computing the services are software, platform and infrastructure. The Software as a

service (SaaS), Plateform as a Service (PaaS) and Infrastructure as a Service (IaaS)

terminologies are used in cloud computing. For more detail and profound comparison of

cluster, grid and cloud see ref [1].

The distributed computing paradigm endeavors to tie together the power of large number

of resources distributed across a network. Each user has the requirements that are shared

in the network architecture through a proper communication channel [18]. Distributed

computing paradigm is used for three major reasons. First, the nature of distributed

applications suggests the use of a communication network that connects several

computers. Such networks are necessary for producing data that are required for the

execution of tasks on remote resources. Second, most of the parallel applications have

multiple processes that run concurrently on many nodes communicating over a high-

speed interconnect. The use of high performance distributed systems for parallel

applications is beneficial as compared to a single CPU machine for practical reasons. The

Page 20: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

3

ability of services distributed in a wide network is low-cost and makes the whole system

scalable and adapted to achieve the desired level of the performance efficiency [14].

Third, the reliability of the distributed system is higher than a monolithic single processor

machine. A single failure of one network node in a distributed environment does not stop

the whole process as compared to a single CPU resource. Some techniques for achieving

reliability on a distributed environment are check pointing and replication [14].

The birth of multi-core systems has significantly advanced the existing technologies in

the domain of computer architecture and HPCs. However, this advantage presents the

research community with enormous challenges, such as the efficient handling of thermal

dissipation and the lack of mature scheduling techniques.

Normally, all the cores of a chip operate in the same clock domain, clock frequency, and

operational voltage [19]. However, there exist systems in which the cores do not operate

at the same frequency. Therefore, maintaining performance symmetry among

asymmetrically operating cores is one of the most critical issues that the researchers are

dealing with today [20, 21]. There are two possible solutions for the abovementioned

issues: (i) add dynamic voltage circuitry per core (a hardware solution), or (ii) schedule

tasks among cores judicially to enable all the cores to operate on the same clock

frequency (a software solution). The former compensation strategy exhibits power

leakage at higher frequencies and undermine the thermal throttling [20]. Being a

promising alternative, the latter solution is relatively unexplored from the point of view

of scheduling. Considering this gap, we partition a given workload among the cores with

the intention that all the cores operate on the same clock frequency for maximum energy

savings.

The newer processors provide an interface to dynamically adjust the voltage (or speed)

for optimized power consumption. This voltage (speed) adjustment on run time is termed

as Dynamic Voltage Scaling (DVS), which is an effective methodology for the reduction

of core power consumption. The dynamic clock and voltage adjustments represent the

cutting edge of power reduction capabilities in Complementary Metal Oxide

Semiconductor (CMOS) circuitry. The relation between frequency and voltage/power

provides foundation for dynamic voltage scaling in modern processors [20, 22, 23, 24].

Theoretically, an ideal processor would be the one that supplies continuous voltage

levels. However, using continuous variable voltages is infeasible because of the

switching overhead to support several operational levels. Therefore, the latest processors

Page 21: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

4

are capable of supporting a fixed number of discrete-level speeds between a predefined

minimum and maximum levels. It has been reported in [25] that the energy–speed curve

is convex in nature. Therefore, according to Jensen‟s inequality [26, 27, 28], as long as

the deadline constraints are fulfilled, it is more energy efficient to execute tasks at a

constant speed than at a variable speed for each of the individual tasks. We further extend

this result by exploring the possibility of determining a uniform system speed for all the

cores by considering a processor that supports a large number of discrete energy–voltage

levels.

In real-time systems, tasks are scheduled based on some predefined criteria, such as

activation rates, deadlines, and priorities [29, 30, 31, 32, 33]. The higher the priority of a

task, the more is the attention devoted to the task when a scheduling decision is to be

made. Real time systems are usually not fully utilized up to the maximum extent.

Therefore, the systems are a promising venue to apply DVS methodologies and DVS

enabled scheduling techniques. Applying DVS techniques requires careful consideration

of task scheduling and a number of results are available (primarily for the uni-processor

systems) [19, 27, 28, 34, 35, 36, 37, 38, 39].

Continuous speed levels are normally assumed to obtain optimality. However, the

aforesaid is inapplicable to practical systems that have processors with discrete voltage

regulators [40, 41]. Manufacturers are introducing processors that will operate on more

discrete levels than what we see today. For instance, the new Foxon technology is

expected to enable the Intel servers to operate on as many as 64 speed grades [40].

Therefore, an accurate model for reducing the energy consumption of the latest systems

must capture the discrete rather than the continuous nature of the available speed scaling

[25, 40, 41, 42]. However, the work we present here can easily be extending to systems

that may operate on a continuous speed spectrum.

The most commonly used policy to schedule real-time tasks is the “priority driven”, that

can be classified into the following two types: (i) fixed priority and (ii) dynamic priority

[43]. A fixed priority algorithm assigns a fixed value to priority to all jobs in each task,

which should be different from the priorities assigned to jobs generated by other tasks

within the system. In contrast, dynamic-priority scheduling algorithms place no

restrictions on the manner in which priorities are assigned to individual jobs. Although,

dynamic algorithms are better considered theoretically, they become unpredictable when

Page 22: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

5

transient overload occurs [44]. Therefore, in this work, we only consider fixed-priority

scheduling due to its applicability, reliability, and simplicity [29, 32, 33, 45, 46, 47].

The problem of scheduling periodic tasks under a fixed-priority scheme was first

addressed by Liu and Layland [29] in 1973 with simplified assumptions. They derived

the optimal static priority scheduling algorithm for implicit -deadline model (when

deadlines coincide with respective periods), termed the RM algorithm. The RM algorithm

assigns static priorities on the task activation rates (periods) such that for any two tasks

i and j , priority ( i ) priority ( j ) period ( i ) < period ( j ), wherein ties are

broken arbitrarily. For a constrained deadline system, where deadlines are not greater

than periods, an optimal priority ordering has been reported in [48], termed the Deadline

Monotonic (DM) scheduling, wherein, the assigned priorities are inversely proportional

to the relative deadlines. The Rate Monotonic (RM) and DM methodologies are identical

when the relative deadlines of tasks are proportional to their periods. In the remainder of

this work, a task model refers to a constrained deadline system, and both RM and DM

will be used interchangeably to align with the terminologies used in the literature.

Scheduling policies developed for symmetric multiprocessors may also be applicable to

the multi-core counterpart. Recently, the fixed-priority scheduling theory for multi-core

environment was studied in [34]. We extend the abovementioned work to further explore

the necessary and sufficient condition [49] of the RM paradigm pertaining to multi-core

systems. In particular, more interesting results are revealed for multi-core systems where

all the cores operate at the same clock frequency [50]. Once the speed for a generic core

i is determined, the average system speed suitable for all the cores is calculated.

However, this average speed might potentially make the task set un-schedulable on some

cores. In this work, we address this anomaly to maintain system feasibility by shifting

tasks from a heavily utilized core to an underutilized core such that all the cores process

the same workload and the task set remains feasible at uniform system speed.

Genetic algorithm [51] is an optimization algorithm. It is influence from the notion of

Darwin‟s theory [52], which is “survival of the fittest” [53]. The algorithm retains the

fittest genes. The optimization process of GA is such that, initially offspring‟s fitness

values are calculated and some offsprings are selected by using any selection method.

The selection method can be random, roulette wheel or tournament. The selected

offsprings are then passes through cross over [51] and mutation [51] phases. Finally, the

Page 23: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

6

fitness values of the new offsprings are calculated. After cross over and mutation, only

those genes retain in new population whose new fitness value is better than old fitness

value and thus the process of optimization takes place.

The overall performance of HPC system depends on resource management and load

balancing among all computing units. Efficient resource management and load balancing

is a key and fundamental requirement to the success of any HPC environment like cluster,

gird, cloud and multi-core systems [54]. For load balancing among computing units the

author only focuses on one dimension of HPC systems which is multi-core and applies

the load balancing strategies such as task migration and task splitting for load balancing.

The ideas of the aforementioned strategies can easily be extended to distributed HPC

systems (Cluster, Grid and Cloud).

1.2 Motivation

Normally, the amount of consumption of energy is directly proportional to population

i.e., energy consumption increases with increase in population. As the use of computer

technology increases, the power (energy) consumption also increases. Less efficient

computers consume more energy that not only wastes precious energy but also increases

pollution [55]. Some of the pollutions due to enhancement in computer technology are in

the form of toxic materials and carbon dioxide in power plants used for production of

computers [55]. The foremost motivation behind this research is reduction of energy

consumption that results saving of money and energy for further usage, as well as acts as

consent for green computing.

Power efficient resource allocation in HPC systems is most important for several reasons

[195]. Firstly, the costs of electricity for cooling and powering of resources are going

beyond the actual purchasing cost of the resources. Secondly, increase in energy usage

and its associated carbon emission have provoked environmental concerns. And finally,

increase in energy usage and heat dissipation has negative impacts for reliability, density,

and scalability of HPC systems [196].

1.3 Problem Statement

Resource allocation mechanisms play a vital role in performance improvement of HPCs

systems. Therefore, a comprehensive discussion of widely used resource allocation

strategies in distributed HPC systems is required. However, maintaining system-timing

Page 24: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

7

constraint on HPC is a challenge and getting attention from research community these

days. This research work address the problem of how the power(energy) efficiency is

obtained by using task scheduling to adjust system speed on the fly in multi-core

environment, enabling uniform system speed. This work also deal with the problem of

load balancing among cores.

1.4 Research Issues

The key issue addressed in Chapter 2 at the abstract level is “resource allocation in high

performance distributed computing systems” [1]. There are three broad classes of

distributed HPC systems: (a) cluster, (b) grid, and (c) cloud. Besides other factors, the

performances of the abovementioned classes are directly related to the resource allocation

mechanisms used in the system. Therefore, in the said perspective, a complete analysis of

resource allocation mechanism used in the HPCs classes is required. The features of the

HPC categories (cluster, grid, and cloud) are conceptually similar [56]. Therefore, efforts

are required to distinguish each of the categories by selecting relevant distinct features

for all, and catalog the systems into pure software and hybrid/hardware HPC solutions.

Author believe that the comprehensive analysis of leading research and commercial

projects in HPC domain can provide readers with an understanding of the essential

concepts of the evolution of the resource allocation mechanisms in HPC systems.

Moreover, the solution of aforementioned research issues will help individuals and

researchers to identify the important and outstanding issues for further investigation.

Conceptually, the research issue addressed in Chapter 3 and Chapter 4 is “how power

(energy) efficiency is obtained by adjusting system speed in multi -core environment” [2,

4]. As a known phenomenon, more computational power is offered by current real-time

systems to cope with CPU intensive applications. However, this facility comes at the

price of more energy consumption and eventually higher heat dissipation. As a remedy,

these issues are being encountered by adjusting system speed on the fly so that

application deadlines are met, and the overall system energy consumption is reduced. In

addition, the current state of the art of multi-core technology opens further research

opportunities for energy reduction through power efficient scheduling. However, the

multi-core front is relatively unexplored from the perspective of task scheduling. To the

best of author‟s knowledge, very little is known as of yet to integrate power efficiency

component into real-time scheduling theory that is tailored for multi-core platforms. In

Page 25: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

8

Chapter 3, the issue is addressed through a novel approach called LFS, while Chapter 4

address the issue by adding GA into an existing approach FFS and termed the new

approach as GA-FFS.

The application of the proposed power minimization approaches (LFS and GA-FFS) can

leads to unbalanced utilization of computing units. While, load balancing among

computing units (cores or systems) plays a vital role in overall performance of HPC.

Efficient results may not be obtained unless a specific load is properly balanced among

systems or cores in HPC. The primal issue addressed in Chapter 5 is load balancing

among cores in multi-core environment by using task shifting (migration) and task

splitting mechanisms.

1.5 Contributions of The Thesis

The thesis contributes to research community working in the field of HPC systems. The

highlighted contributing aspects of Chapter 2 of the thesis are as follows:

1. First, the said chapter analyses and differentiate distributed HPC systems (cluster,

grid and cloud) based on some predefined common features.

2. Further, Chapter 2 deeply analyses resource allocation mechanisms of cluster,

grid, and cloud systems [1].

3. In the said chapter the author discuss the common features of each category and

comparing the resource allocation mechanisms of the systems based on the

selected features [1].

4. Finally, Chapter 2 classifies analysed systems of cluster, grid and cloud as

software only and hybrid/hardware systems [1].

Chapter 3 of the thesis advances the current state of the art of scheduling theory as

follows.

1. Identification of the lowest possible core speed. The work presented in Chapter 3

identifies the implicit disadvantage associated with the FFS approach that is often

used in the literature. Author further investigate this issue and identify properties

and bounds that enables to identify a procedure, which can further reduce the core

speed to the minimum possible level and also ensure that the task set remains RM

schedulable [2].

2. Practical power savings with adjustable core speeds. Because of the practical

limitations of the available DVS-enabled processors, the tasks are mapped using a

Page 26: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

9

finite number of discrete voltage levels. However, the work presented in Chapter 3

can also be equally applicable to future generation processors that may support

continuous voltage levels [2].

3. Present a simple but practical core load balancing procedure. Author in Chapter

3 presents the lightest task shift procedure to load balance the system cores. The

motivation behind this mechanism is based on the observation that the lightest

task (with lowest utilization among all the tasks assigned to a core) is the only

task that decreases the core utilization by the minimum possible load by shifting

(or migrating) the task from an over utilized core to the underutilized cores [2].

4. Achieving uniform system power consumption and utilization. The focus of

Chapter 3 was kept as general as possible to include heterogeneous system cores.

The approach in Chapter 3 can fine-tune the system so that all the cores operate on

the same clock rate and have equally proportionate core utilization. The

abovementioned results in a uniform system performance with predictable power

consumption. The approach presented in Chapter 3 can be useful for designing

applications that demand homogeneous performance over a heterogeneous system

[2].

5. Presents a novel energy efficient approach LFS. A novel approach called LFS is

the main contribution, discussed in Chapter 3 that greatly enhances energy

minimization in multi-core environment.

Chapter 4 of the thesis contributes to research community by enhancing the power

consumption of an existing technique (FFS) [46] by applying genetic algorithm in

addition with FFS and termed the new technique as GA-FFS [4].

Chapter 5 contributes research community by balancing load among cores through task

shifting and task splitting strategies. The said chapter concludes that if response time is

important, then use task shifting (migration) mechanism, otherwise use task splitting for

fully load balancing [3].

1.6 Organization of The Thesis

The rest of the thesis is organized as; In Chapter 2, a comprehensive “survey on resource

allocation in high performance distributed computing systems” [1] is presented. Cluster,

grid and cloud computing systems come in the category of distributed HPC systems. In

Chapter 2, analysis of existing distributed HPC systems based on predefined features is

Page 27: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

10

under consideration. To cover the multi-core side of HPC systems, “Power efficiency

through least feasible speed in multi-core environment” [2] is the focus of Chapter 3. The

foremost focus of Chapter 3 is to allocate the computing resources in such a way that

optimizes the overall speed and power consumption in multi-core environment. In

Chapter 4, GA is applied to FFS [4] for efficiency of speed and power of existing

technique i.e., FFS. The new technique introduced in Chapter 4 is termed as GA-FFS.

The primal focus of Chapter 5 is “load balancing through task shifting and task splitting

strategies in multi-core environment” [3]. Finally, Chapter 6 concludes the thesis and

discusses some recommendations and future directions.

1.7 Summary

Obtaining maximum system utilization with minimum energy consumption has been the

central focus for researcher working in the domain of power efficient resource allocation.

Encouraging results have been presented recently by applying energy aware scheduling

techniques in HPC environments. The motivation is normally extending battery life to

make the device operation for longer and also to reduce heat dissipation. On the other

hand, more computational power in HPC comes at the price of more energy consumption.

In HPC the computational power increases in two dimensions, i-borrow computational

(other resources may also be borrowed) power of other computers i.e., cluster, grid and

cloud, ii-increase the processing power of a single system by incorporating more cores to

the same chip. Both dimensions have its merits and demerits. Careful resources allocation

plays a vital role in an energy efficient computing facility. In multi-core systems when

the cores run asymmetrically the main problem is the load balancing among cores. In

such systems, it is more energy efficient to execute tasks at constant speed, as long as

deadline constrained are fulfilled [27, 28, 57].

In this research work, author first carried out survey of resource allocation in distributed

HPC (cluster, gird and cloud) systems, and later, extended the work to the second

dimension of HPC i.e., resource allocation (scheduling) and load balancing in multi-core

systems. The author proves that for core speed calculation, the conventional FFS [46] is

inferior to author‟s proposed approaches the LFS concept and GA-FFS in energy

consumption point of view.

Page 28: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

11

Chapter 2

Resource Allocation in High Performance Distributed

Computing Systems

Page 29: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

12

2.1 Introduction

The purpose of this chapter is to analyze the resource allocation mechanism of three

broad classes of HPC: (a) cluster, (b) grid, and (c) cloud. Besides other factors, the

performances of the aforesaid classes are directly related to the resource allocation

mechanisms used in the system. Therefore, in the said perspective, a complete analysis of

resource allocation mechanism used in HPCs classes is required. In this chapter, the

author present a thorough analysis and characteristics of the resource management and

allocation strategies used in academic, industrial, and commercial system.

The features of the HPC categories (cluster, grid, and cloud) are conceptually similar

[56]. Therefore, an effort has been made to distinguish each of the categories by selecting

relevant distinct features for all. The features are selected based on the information

present in the resource allocation domain, acquired from a plethora of literature. Author

believe that the comprehensive analysis of leading research and commercial projects in

HPC domain can provide readers with an understanding of the essential concepts of the

evolution of the resource allocation mechanisms in HPC systems. Moreover, this research

will help individuals and researchers to identify the important and outstanding issues for

further investigation. The highlighted aspects of this chapter are as follows:

1. Analysis of resource allocation mechanisms of cluster, grid, and cloud.

2. Identifying the common features of each category and comparing the resource

allocation mechanisms of the systems based on the selected features.

3. Classification of systems as software only and hybrid/hardware systems.

In contrast to the other compact surveys and system taxonomies, such as [58, 59], the

focus of this study is to demonstrate the resource allocat ion mechanisms. Note that the

purpose of this survey is to demonstrate the resource allocation mechanisms and not the

performance analysis of the systems. Although, the performance can be analyzed based

on the resource allocation mechanism but this is not the scope of the chapter. The

purpose of this chapter is to aggregate and analyze the existing solutions for HPC under

the resource allocation policies. Moreover, an effort has been made to provide a broader

view of the resource allocation mechanisms and strategies by discussing systems of

different categories, such as obsolete systems (systems that were previously being used),

academic system (research projects proposed by institutes and universities), and

established systems (well-known working systems). The projects are compared on the

Page 30: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

13

basis of the selected common features within the same category. For each category, the

characteristics discussed are specific and the list of features can be expanded further.

Finally, the systems are cataloged into pure software and hybrid/hardware HPC solutions.

The rest of the chapter is organized as follows: In Section 2.1, author presents the HPC

system classification and highlights the key terms and the basic characteristics of each

class. In Section 2.2, author survey the existed HPC system research projects and

commercial approaches of each classification (cluster, grid, and cloud). The projects are

cataloged into pure software and hybrid/ hardware solutions in Section 2.3.

Energy efficient resource allocation in HPC systems is most important for several reasons

[195]. Firstly, the costs of electricity for cooling and powering of resources are going beyond

the actual purchasing cost of the resources. Secondly, increase in energy usage and its

associated carbon emission have provoked environmental concerns. And finally, increase in

energy usage and heat dissipation has negative impacts for reliability, density, and scalability

of HPC systems hardware [196].

In order to minimize the implementation and computational complexity, the original problem

(power-efficient resource allocation in distributed HPCs) with multiple constraints is

decomposed into multiple optimize-able sub-problems (power-efficient resource allocation in

multi-core) with simple constraints. For the latter problem, two power efficient resource

allocation algorithms (LFS and GA-FFS) were proposed in Chapter 3 and Chapter 4

respectively. The simulation results of the said chapters reveal that the proposed algorithms

outperforms than existing counterpart (FFS) in terms of speed and hence power (energy).

Therefore, the authors‟ focus in this chapter is only to demonstrate existing resource allocation

mechanisms in distributed HPCs that are given in the subsequent sections.

2.2 Overview of Distributed HPC Systems

The section discusses three main categories of HPC systems that are analyzed, evaluated,

and compared based on the set of identified features. Author put cloud under the category

of HPC because it is now possible to deploy a HPC cloud, such as Amazon EC2. Clusters

having 50,000 cores have been run on Amazon EC2 for scientific applications [60].

Moreover, the HPC workload is usually massively high scale and has to be run on many

machines, which is naturally compatible with a cloud environment. The taxonomy

representing the categories and the selected features used for the comparison within the

same category are shown in Figure 2.1.

Page 31: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

14

Features

HPC

Cluster

Grid

Cloud

Resource Allocation

Job Processing Type

QoS Attributes

Job Composition

Resource Allocation Control

Platform Support

Evaluation Method

Scheduling Organization

System Type

Resource Description

Resource Allocation Policy

Breadth of scope

Triggering Info

System Functionality

Implementation Structure

System Focus

Dynamic Negotiation of QoS

Web APIs

Virtualization

Services

User Access Interface

Value added Services

Process Migration

Figure 2.1: HPC Systems categories and attributes [1]

Dong et al. [61] designed taxonomy for the classification of scheduling algorithms in

distributed systems. Moreover, Ref. [61] has broadly categorized scheduling algorithms

as: (a) Local vs. Global, (b) Static vs. Dynamic, (c) Optimal vs. Suboptimal, (d)

Distributed vs. Centralized, and (e) Application centric vs. Resource centric. Apart from

above classification, different variants of scheduling, such as conservative, aggressive,

and no reservation can also be found in literature [62, 63]. In conservative scheduling [38,

64], processes allocate required resources before execution. Moreover, the operations are

delayed for serial execution of the tasks that helps in process sequencing. The delay is

also helpful in rejection of the processes. In an aggressive (easy) scheduling [64],

operations are immediately scheduled for execution to avoid delay in the operations.

Moreover, the operations are reordered on the arrival of new operations. In some

situations when a task cannot be completed in serial way, the operations are rejected. In

aggressive scheduling, the operations are not delayed but have rejection risk at later

Page 32: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

15

stages. However, in conservative scheduling the operations are not rejected but delayed.

No reservation [65] is a dynamic scheduling technique where the resources are not

reserved prior to execution but allocated at run time. The resources without reservation

are wasted because, if a resource is not available at request time, then the process has to

wait till the availability of resource.

2.2.1 Distributed HPC Systems Classes

This section focuses distributed HPC systems. A brief introduction of each category of

distributed HPC (cluster, grid and cloud) system is given.

2.2.1.1 Cluster Computing Systems

Cluster computing commonly referred as clustering, is the use of multiple computers,

multiple storage devices, and redundant interconnections to form a single highly

available system [5]. Cluster computing can be used for high availability and load

balancing. A common use of cluster computing is to provide load balancing on high-

traffic websites. The concept of clustering was already present in DEC ‟s VMS systems

[66, 67]. IBM‟s Sysplex is a cluster-based approach for a mainframe system [68].

Microsoft, Sun Microsystems, and other leading hardware and software companies offer

clustering packages for scalability and availability [69]. With the increase in traffic or

availability assurance, all or some parts of the cluster can be increased in size or number.

The goal of cluster computing is to design an efficient computing platform that uses a

group of commodity computer resources integrated through hardware, networks, and

software to improve the performance and availability of a single computer resource [6,

7]. One of the main ideas of cluster computing is to portray a single system image to the

outside world. Initially, cluster computing and HPC were referred to the same type of

computing systems. However, today‟s technology enables the extension of cluster class

by incorporating load balancing, parallel processing, multi-level system management and

scalability methodologies. Load balancing algorithms [70] are designed essentially to

equally spread the load on processors and maximize the utilization while minimizing the

total task execution time. To achieve the goals of cluster computing, the load-balancing

mechanism should be fair in distributing the load across the processors [71]. The

objective is to minimize the total execution and communication cost encountered by the

task assignment, subject to the resource constraints.

Page 33: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

16

The extension of traditional clusters transforms into user-demand systems (provides

SLA-based performance) that deliver RAS needed for HPC applications. A modern

cluster is made up of a set of commodity computers that are usually restr icted to a single

switch or group of interconnected switches within a single VLAN [72]. Each compute

node (computer) may have different architecture specifications (single processor

machine, symmetric multiprocessor system, etc.) and access to various types of storage

devices. The underlying network is a dedicated network made up of high-speed and low-

latency system of switches with a single or multi-level hierarchical internal structure. In

addition to executing compute-intensive applications, cluster systems are also used for

replicated storage and backup servers that provide essential fault tolerance and reliability

for critical parallel applications. Figure 2.2, depicts a cluster computing system that

consists of: (a) Management servers (responsible for controlling the system by taking

care of system installation, monitoring, maintenance, and other tasks), (b) Storage

servers, disks, and backup (storage servers are connected to disks for the storage purpose

and the disks are connected to backup for data backup purposes, the storage server in

Figure 2.2 provides a shared file system access across the cluster), (c) User nodes (used

by system users to login to user nodes to run the workloads on each cluster), (d)

Scheduler nodes (users submit their work to a scheduler nodes to run the workload), and

(e) Computer nodes (run the workloads).

Admin

Scheduler Nodes

Computer Nodes

User Nodes

Storage Servers

Management Servers

Disk

Backup

Figure 2.2: A Cluster Computing System Architecture

Page 34: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

17

2.2.1.2 Grid Computing Systems

The concept of grid computing is based on using the Internet as a medium for the wide

spread availability of powerful computing resources as low-cost commodity components

[8]. Computational grid can be thought of as a distributed system of logically coupled

local clusters with non-interactive workloads that involve a large number of files [73,

74]. By non-interactive we mean that assigned workload is treated as a single task. The

logically coupled clustering refers that the output of one cluster may become input for

another cluster, but within a cluster the workload is interactive. In contrast with the

conventional HPC (cluster) systems, grids account for different administrative domains

with access policies, such as user privileges [9, 10]. Figure 2.3 depicts a general model of

grid computing system.

The motivations behind grid computing were the resource sharing and problem solving in

multi-institutional and dynamic virtual organizations as depicted in Figure 2.3. A group

of individuals and institutions form a virtual organization. In virtual organization, the

individuals and the institutions define rules for resource sharing. Such rules can be: what

is shared on the basis of what condition and to whom, etc. [75]. Moreover, Grid

guarantees the secure access by user identification. The aggregate throughput is more

important than the price and overall performance of a grid system. What makes grid

different from conventional HPC systems, such as cluster, is that grids tend to be more

loosely coupled, heterogeneous, and geographically dispersed [11, 12].

Figure 2.3: A Model of Grid Computing System

Page 35: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

18

2.2.1.3 Cloud Computing Systems

Cloud computing describes a new model for IT services based on the Internet, and

typically involves provision of dynamically scalable and often virtualized resources over -

the-Internet [14, 15, 16]. Moreover, cloud computing provides the ease-of-access to

remote computing sites using the Internet [76, 77]. Figure 2.4 shows a generic layered

model of cloud computing system. The user-level layer in Figure 2.4 is used by the users

to deal with the service provided by the cloud. Moreover, the top layer also uses the

services provided by the lower layer to deliver the capabilities of SaaS [78]. The tools

and environment that are required to create interfaces and applications on the cloud is

provided by the user-level middleware layer. The runtime environment that enables cloud

computing capabilities to application services of user-level middleware is provided by the

core middleware layer. Moreover, the computing capabilities are provided by the layer

through implementing the platform level services [78]. The computing and processing

power of cloud computing is aggregated through data centers. At the system level layer

physical resources, such as storage servers and application servers are available that

powers up the data center [78].

The current cloud systems, such as Amazon EC2 [79], Eucalyptus [80], and LEAD [81]

are based on the VGrADS, sponsored by NIST [82]. The term "Cloud" is a metaphor for

the Internet. The metaphor is based on the cloud drawing used in the past to represent the

telephone network [83] and later to depict the Internet in computer network diagrams as

an abstraction of the underlying infrastructure [84]. Typical cloud computing providers

deliver common business applications online that are accessed through web service and

the data and software are stored on the servers. Clouds often appear as a single point of

access for computing the consumer needs. Commercial offerings are generally expected

to meet the QoS requirements of customers, and typically include SLAs [85].

The model of the cloud requires minimal management and interactions with IT

administrators and resource providers, as seen by the user. Alternatively, self-monitoring

and healing of cloud computing system requires complex networking, storage, and

intelligent system configuration. Self-monitoring is necessary for automatic balancing of

workloads across the physical network nodes to optimize the cost of system utilization.

Failure of any individual physical software or hardware component of the cloud system is

arbitrated swiftly for rapid system recovery.

Page 36: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

19

Cloud Applications

Social Computing, Enterprise, ISV,

Scientific, CDNs

Environment and Tools

Web 2.0 , Mashups, Scripting, Libraries

QoS negitiation,

Controls and policies,

SLA management,

Acounting

VM

Management

and

Deployment

Resource

Compute, Storage, ...

Application Hosting Platform

Core

Middleware

User- level

Middleware

User- level

System-level

User

Figure 2.4: A Layered Model of Cloud Computing System

Table 2.1 depicts the common attributes among the HPC categories, such as size, network

type, and coupling. Moreover, no numeric data is involved in the Table 2.1. For example,

the size of the grid is large as compared to cluster. The network grid is usually private

and over WAN that means that grids spread over the Internet is owned by a single

company. Foster et al. [58] uses various perspectives, such as architecture, security

model, business model, programming model, virtualization, data model, and compute

model to compare grids and clouds. Sadashiv et al. [59] have done a comparison of three

computing models (cluster, grid, and cloud) based on different characteristics, such as

business model, SLA, virtualization, and Reliability. Similar comparison can also be

found in [86]. Another comparison amongst the three computing models can also be

found [87].

Page 37: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

20

Table 2.1: Commonality between cluster, grid, and cloud systems.

Feature Cluster Grid Cloud

Size

Network type

Job management and scheduling

Coupling

Resource reservation

SLA constraint

Resource support

Virtualization

Security type

SOA and heterogeneity support

User interface

Initial infrastructure cost

Self service and elasticity

Administrative domain

Small to medium

Private, LAN

Centralized

Tight

Pre-reserved

Strict

Homogeneous and heterogeneous (GPU)

Semi-virtualized

Medium

Not supported

Single system image

Very high

No

Single

Large

Private, WAN

Decentralized

Loose/tight

Pre-reserved

High

Heterogeneous

Semi-virtualized

High

Supported

Diverse and dynamic

High

No

Multi

Small to large

Public, WAN

Both

Loose

On-demand

High

Heterogeneous

Completely virtualized

Low

Supported

Single system image

Low

Yes

Both

2.2.2 Cluster Computer Systems: Features and Requirements

The overall performance of the cluster computing system depends on the features of the

system. Cluster systems provide a mature solution for different types of computation and

data-intensive parallel application. Among many specific system settings related to a

particular problem, sets of basic generic cluster properties can be extracted as a common

class of classical and modern cluster systems [88]. The extracted features shown in

Figure 2.1 are defined in the following paragraphs.

2.2.2.1 Job Processing Type

Jobs submitted to the cluster system may be processed as parallel or sequential. The jobs

can be characterized as sequential or parallel based on the processing of the tasks

involved in the job. A job that consists of parallel tasks has to execute concurrently on

different processors, where each task starts at the same time. (The readers are encourage

to see [89] for more details on HPC job scheduling in Cluster.) Usually, the sequential

jobs are executed at a single processor as a queue of independent tasks. Parallel

applications are mapped to the multi-processor parallel machine and are executed

simultaneously on the processors. The parallel processing mode speeds up the whole job

execution and the appropriate strategy is to solve the complex large-scale problems

within a reasonable amount of time and cost. Many conventional market -based cluster

Page 38: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

21

resource management systems support sequential processing mode. However, number of

compute intensive applications must be executed within a feasible deadline. Therefore,

parallel processing mode of cluster job is implemented in high-level cluster systems, such

as SLURM [90], Enhanced MOSIX [14], and REXEC [15].

2.2.2.2 QoS Attributes

QoS attributes describe the basic service requirements requested by the consumers that

the service provider is required to deliver. The consumer represents a business user that

generates service requests at a given rate that needs to be processed by the system.

General attributes involved in QoS are: (a) time, (b) cost, (c) efficiency, (d) reliability,

(e) fairness, (f) throughput, (g) availability, (h) maintainability, and (i) security. QoS

metrics can be estimated by using various measurement techniques. However, such

techniques are difficult to use in solving a resource allocation problem with multiple

constraints. The difficulty in resource allocation problem with multiple constraints is still

a critical problem in cluster computing. In some conventional cluster systems, REXEC

[15], Cluster-on-Demand [76], and Libra SLA [77], the job deadline and user defined

budget constraints, such as: (a) fairness, (b) time, and (c) cost are considered. Market-

based cluster RMS still lacks efficient support of reliability or trust. Recent applications

that manipulate huge bytes of distributed data must provide guaranteed QoS during

network accessibility. Providing the best effort services by ignoring the network

mechanism is not enough to the customer requirements. (Readers are encouraged to read

[91, 92] for more understanding of QoS attribute in clusters.)

2.2.2.3 Job Composition

Job composition depicts the number of tasks involved in a single job prescribed by the

user. A single-task job is defined as a monolithic application, in which just a single task

is specified as depicted in Figure 2.5(a). Parallel (or multi-task) jobs are usually

represented by a DAG as shown in Figure 2.5(b). Moreover, the nodes express the

particular tasks partitioned from an application and the edges represent the inter -task

communication [93] (please read [93] for more details).The tasks can be independent or

dependent. Independent tasks can be executed simultaneously to minimize the processing

time. Dependent tasks are cumbersome and must be processed in a pre-defined manner to

ensure that all dependencies are satisfied. Market-based cluster RMSs must support all

Page 39: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

22

three types of job compositions namely: (a) single task, (b) independent multiple -task,

and (c) dependent multiple-task [93].

Job A

Task 1

Job B

Task 1

...

Task 2

Task n

Depends

Depends

Depends

a) b)

Figure 2.5: a) Single Task Job and b) Multiple Task Job

2.2.2.4 Resource Allocation Control

Resource Allocation Control is a mechanism that manages and control resources in a

cluster system. Resource allocation control system can be centralized or decentralized

[86]. The jobs in centralized system are being administered centrally by a single resource

manager that has complete knowledge of the system. In decentralized RMS, several

resource managers and providers communicate with one another to keep the load for all

resources balanced and satisfy the specific users requirements [86]. Figure 2.6 depicts

centralized and decentralized resource management systems (more details please see

[94]).

Resource Manager

R1 R2 R3 RNR4 …

Resource Manager1

R1 R2 R3

Resource Manager2

R4 R5 R 6

(a) (b)

Figure 2.6: Resource Management (a) Centralized Resource Management (b) Decentralized

Resource Management

2.2.2.5 Platform Support

Two main categories of cluster infrastructure to support the execution of cluster

applications are homogeneous and heterogeneous platforms. In a homogeneous platform,

the system runs on a number of computers with similar architectures and same OS. In a

Page 40: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

23

heterogeneous platform, the architecture and the OS of the nodes are different.

2.2.2.6 Evaluation Method

The performance of cluster system can be evaluated through several metrics to determine

the effectiveness of different cluster RMSs. The performance metrics are divided into two

main categories, namely system-centric and user-centric evaluation criteria [95]. System

centric evaluation criteria depict the overall operational performance of the cluster.

Alternatively, user centric evaluation criteria portray the utility achieved by the

participants. To assess the effectiveness of RMS system-centric and user-centric criteria,

evaluation factors are required. System-centric factors guarantee that system performance

is not compromised and user-centric factors assure that desired utility of various RMS are

achieved from participant perspective [95]. The system-centric factors can include disk

space, access interval, and computing power. User-centric factors can include the cost

and execution time of the system. Moreover, a combination of system-centric and user-

centric approach can be used to form another metric that uses features from both, to

evaluate the system more effectively [95, 96].

2.2.2.7 Process Migration

In cluster, transfer of job from one computer to another without restarting is known as

process migration. A standard cluster RMS usually provides the process migration in

homogeneous systems. The migration in heterogeneous systems is much more complex

because of the numerous complex conversion processes from sources to destination

access points.

2.2.2.8 Correlation of Cluster Features and Resource Allocation

All the cluster computing features defined in the previous paragraphs are crucial for an

efficient resource management and allocation in the system. The job scheduling policy

strictly depends on the type of the job processed in a cluster. Moreover, the job

scheduling is completely different for batch (group) scheduling as compared to sequential

and simultaneous processed applications. The job processing schema, together with a job

structure, determines the speed of the cluster system. The monolithic single-task job

processed in a sequential mode is the main reason for possible ineffective system

utilization, because some cluster nodes are kept idle for a long time.

Page 41: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

24

The cluster RMSs are defined as a system middleware that provides a single interface for

user-level applications to be executed on the cluster. The aforementioned, allows the

complexities of the underlying distributed nature of the clusters to be hidden from the

users. For effective management, the RMS in cluster computing requires some knowledge

of how users value the cluster resources. Moreover, RMS provides support for the users

to define QoS requirements for the job execution. In such scenarios, the system-centric

approaches have limited abilities to achieve the user desired utility. Therefore, the focus

is to increase the system throughput and maximize the resources utilization.

The administration of a centralized RMS is easier than the decentralized structures

because a single entity in the cluster has complete knowledge of the system. Moreover,

definition of communication protocols for different local job dispatchers is not required.

Furthermore, the reliability of centralized systems may be low because of the complete

outage of the system in case of central cluster node failure. The distributed administration

can tolerate a loss if any node is detached from a cluster. Another important factor of

resource allocation in cluster systems is the platform support. In homogenous systems,

the resource types are related to the specified scheduling constraints and service

requirements defined by the users. The analysis of the requests sent to the system can

help in managing the resource allocation process. In heterogeneous platform, the range of

resources required may vary that cause an increase in the complexity of the resource

management. A phenomenon where a process, task, or request is permanently denied for

resources is known as resource starvation. The probability of occurring resource

starvation is less in heterogeneous platform as compared to the homogenous platform.

If any node is disconnected from the cluster, then the workload of the node can be

migrated to other nodes present in the same cluster. Migration adds the reliability and

balancing of resource allocation across the cluster. A single node can request the

migration of resources when a request received is difficult to handle. The cluster as a

whole is responsible for process migration.

2.2.3 Grid Computer Systems: Features and Requirements

Grid systems are composed of resources that are distributed across various organizations

and administrative domains [56]. A grid environment needs to dynamically address the

issues involved in sharing a wide range of resources. Moreover, various types of grid

systems, such as CGs, desktop, enterprise, and data grids can be designed [97, 56]. For

Page 42: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

25

each type of grid, a set of various features can be defined and analyzed. Author presents

multiple grid properties that can be extracted from different grid classes to form the

generic model of a grid system. A generic model could have all or some of the following

extracted properties.

2.2.3.1 System Type

The large ultimate scale of a grid system, require an appropriate architectural model that

allows efficient management of geographically distributed resources over multiple

administrative domains [83]. The system type can be categorized as computational, data,

and service grid based on the focus of a grid. The computational grid can be categorized

as high throughput and distributed computing. The service grid can be categorized as on-

demand, collaborative, and multimedia. In hierarchical models, the scheduling is hybrid,

having centralized scheduling at the top level and decentralized scheduling at the lower

level. Therefore, author categorized system type into three categories: (a) data, (b)

computational, and (c) service.

2.2.3.2 Scheduling Organization

Scheduling organization refers to the way or mechanism that defines the way resources

are being allocated. We have considered three main organizations of scheduling namely:

(a) centralized, (b) decentralized, and (c) hierarchical. In the centralized model, a central

authority has full knowledge of the system. The disadvantages of centralized model are

limited scalability, lack of fault tolerance, and difficulty in accommodating local multiple

policies imposed by the resource owners. In the decentralized model, local schedulers

interact with each other to manage the tasks pool. No central authority is responsible for

the resource allocation in case of decentralized model. Therefore, the model naturally

addresses issues, such as, fault-tolerance, scalability, site-autonomy, and multi-policy

scheduling. The decentralized model is used for large scale network sizes but the

scheduling controllers need to coordinate with each other every time for smooth

scheduling. The coordination can be achieved through resource discovery or resource

trading protocols. Finally, in the hierarchical model, a central meta-scheduler (or meta-

broker) interacts with local job dispatchers to define the optimal schedules. The higher-

level scheduler manages large sets of resources while the lower level job managers

control a small set of resources. The local schedulers have knowledge about resource

Page 43: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

26

clusters but cannot monitor the whole system. The advantage of using hierarchical

scheduling is the ability to incorporate scalability and fault -tolerance. Moreover,

hierarchical scheduling retains some of the advantages of the centralized scheme such as

co-allocation (readers are encourage to see [98] for more details).

2.2.3.3 Resource Description

Grid resources are spread across the wide-area global network with different local

resource allocation policies. The characteristics should include specific parameters

needed to express the resource heterogeneity, structure, and availability in the system. In

the NWS project [85], the specification of availability of CPU, TCP connection

establishment time, end-to-end latency, and available bandwidth are needed for resource

description. Similarly, cactus worm (Section 2.2.2.11) is an on-demand grid computing

system that needs an independent service that is responsible for resource discovery and

selection based on application-supplied criteria, using GRRP and GRIP [99].

2.2.3.4 Resource Allocation Policies

A scheduling policy has to be defined for ordering of jobs and requests when any

rescheduling is required. Different resource utilization policies are available for different

systems due to different administrative domains. Figure 2.7 represents the taxonomy of

resource allocation policies. Resource allocation polices are necessary for ordering the

jobs and requests in all types of grid models. In fixed resource allocation approach, the

resource manager implements predefined policy.

Moreover, the fixed resource allocation approach is further classified into two categories

namely, system oriented and application oriented. System-oriented allocation policy

focuses on maximizing the throughput of the system [100]. The aim of application

oriented allocation strategy is to optimize the specific scheduling attributes, such as time

and cost (storage capacity).

Many examples of systems are available that use application oriented resource allocation

policies, such as PUNCH [101], WREN [102], and CONDOR [103].The resource

allocation strategy that allows external agents or entities to change the scheduling policy

is called an extensible scheduling policy. The aforesaid can be implemented by using ad-

hoc extensible schemes that defines an interface used by an agent for the modification of

the scheduling policy [100].

Page 44: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

27

Resource

Allocation Policy

Fixed Extensible

System Oriented Application Oriented Adhoc Structured

Figure 2.7: Taxonomy of Resource Allocation Policies

2.2.3.5 Breadth of Scope

The breadth of scope expresses the scalability and self-adaptation levels of the grid

systems. If the system or grid-enable application is designed only for specific platform or

application, then the breadth of scope is low. Systems that are highly scalable and self-

adaptive can be characterized as medium or high breadth of scope. Adopting the self-

adaptive mechanisms oriented towards specific type of applications can lead to poor

performance of applications not covered by the mechanisms. One example of a breadth of

scope is a scheduler that read applications as if there is no data dependency between the

tasks but if an application has a task dependency then the schedu ler may perform poorly

[66].

2.2.3.6 Triggering Information

Triggering information refers to an aggregator service that collects information and check

if the data against a set of conditions defined in a configuration file are met [104]. If the

conditions are met, then the specified action takes place. the service plays an important

role in notifying certain actions to the administrator or controller whenever any service

fails. One example of the aforementioned action is to construct an email notification to

the system administrator when the disk space on a server reaches a certain threshold.

Triggering information can be used by the schedulers while allocating resources.

2.2.3.7 System Functionality

System Functionality is an attribute used to define the core aspect of the system, such as

javelin is a system for Internet-wide parallel computing based on java.

Page 45: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

28

2.2.3.8 Correlation of Grid Features and Resource Allocation

Grid systems are categorized into various types based on the characteristics of the

resource, such as resource type and the allocation policies [105]. The primary focus of

computational grids is on processing capabilities. Computational grids are suitable for the

execution of compute-intensive and high throughput applications that usually need more

computing power by a single resource [105]. Scheduling organization determines the

priorities in the resource allocation process. In centralized systems, only single or

multiple resources located in a single or multiple domains can be managed [97]. In

decentralized scheduling model, the schedulers interact with each other to select

resources appropriate for jobs execution. In case of conflicts among resource providers

on a global policy for resource management, the aforesaid system (centralized or

decentralized) can be difficult to implement as a grid system. The hierarchical system

allows remote resource providers to enforce local allocation policies [97]. Several

policies are available for resource allocation. The fixed policy is generally used for

sequentially processed jobs. Extensible policies are used if the application priorities can

be set using the external agents.

2.2.4 Cloud Computing Systems: Features and Requirements

The cloud computing systems are difficult to model with resource contention (competing

access to shared resources). Many factors, such as the number of machines, types of

application, and overall workload characteristics, can vary widely and affect the

performance of the system. A comprehensive study of the existing cloud technologies are

discussed in the following Section i.e., 2.3 based on a set of generic features in the cloud

systems.

2.2.4.1 System Focus

Each cloud system is designed to focus on certain aspects, such as Amazon EC2 is

designed to provide the best infrastructure for cloud computing systems with every

possible feature available to the user [106]. Similarly, GENI system [107] focuses on

providing a virtual laboratory for exploring future internets in a cloud. Google Nimbus

[108] focuses on extending and experimenting for the set of capabilities, such as resource

as an infrastructure and ease of use. Open Nebula [109], provides complete organization

of data centers for on-premise IaaS cloud infrastructure.

Page 46: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

29

2.2.4.2 Services

Cloud computing is usually considered a next step from the grid-utility model [56].

However, the cloud system not only realizes the service but also utilizes resource sharing.

Cloud system guarantees the delivery of consistent services through advanced data

centers that are built on compute and storage virtualization technologies [56, 110]. The

type of services that a system provides to a user is an important parameter to evaluate the

system [111].Cloud computing is all about providing services to the users, either in the

form of SaaS, PaaS, or IaaS. The cloud computing architecture can be categorized based

on the type of services, such as Amazon EC2 provides computational and storage

services. However, the Sun Network.com (Sun Grid) only provides computationa l

services. Similarly, we can categorize Microsoft Live Mesh and GRIDS Lab Aneka as

infrastructure and software based cloud, respectively.

2.2.4.3 Virtualization

Cloud resources are modeled as virtual computational nodes connected through large-

scale network conferring to the specified topology. Peer-to-peer ring topology is a

commonly used example for a cloud resource system and users community organization

[110]. Based on the virtualization, the cloud computing paradigm allows workloads to be

deployed and scaled-out quickly through the rapid provisioning of VM on physical

machines. Author evaluated number of systems based on the entities or the processes

responsible for performing virtualization.

2.2.4.4 Dynamic QoS Negotiation

Real-time middleware services must guarantee predictable performance under specified

load and failure conditions [80]. Provision of QoS attributes dynamically at run-time

based on specific conditions is termed as Dynamic QoS negotiation, such as renegotiable

variable bit-rate [112]. Moreover, dynamic QoS negotiations ensure graceful degradation

when the aforementioned conditions are violated. QoS requirements may vary during the

execution of the system workflow to allow the best adaptation to customer expectations.

Dynamic QoS negotiation in cloud systems are performed by either a dedicated process

or by an entity. Moreover, self-algorithms [96] can also be implemented to perform

dynamic QoS negotiation. . In Eucalyptus group managers [80], the dynamic QoS

operations are performed by the resource services. Dynamic QoS negotiation provides

Page 47: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

30

more flexibility to the cloud that can make a difference while selecting two different

clouds, based on the requirements of the user. That is the reason Dynamic QoS is used as

a comparison attribute in the cloud.

2.2.4.5 User Access Interface

User access interface defines the communication protocol of the cloud system with

general user. Access interfaces must be equipped with the relevant tools necessary for

better performance of the system. Moreover, the access interface can be designed as a

command line, query based, console based, or graphical form interface. Although the

access interface is available in cluster and grid but in case of cloud, the access interface is

important because if the interface provided to the user is not user-friendly, then the user

might not use the service. In another scenario, suppose two CSPs provides same services

but if one has a user-friendly interface and the second one do not, then user would

definitely prefer the one with a user-friendly interface. In such scenarios the access

interface plays an important role and that is why it is used as a comparison feature under

cloud.

2.2.4.6 Web APIs

In cloud, the Web API is a web service dedicated for the combination of multiple web

services into new applications [111]. A set of HTTP request messages and description of

the schema of response messages in JSON or XML format makes a web API [95]. The

ingredients of current Web APIs are REST style communication. Earlier APIs were

developed using SOAP based services [111].

2.2.4.7 Value Added Services

Value added services are defined as additional services beyond the standard services

provided by the system. Value added services are available for a modest additional fee

(or free) as an attractive and low-cost alternative system support. Moreover, the purpose

of value added services are to: (a) promote the cloud system, (b) attract the new service

users, and (c) keep the old service users intact. The services ment ioned in SLA are

standard services. Moreover, the services that are provided to end-users to promote the

standard services come under the category of value added services. Value added services

are important to promote the cloud and to provide an edge over competitors. If one cloud

Page 48: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

31

is only offering SLA based services and other is offering SLA based service plus value

added services too, then generally end-users will prefer the cloud that provides both of

the services.

2.2.4.8 Implementation Structure

Different programming languages and environments are used to implement a cloud

system. The implementation package can be monolithic and consists of a single specific

programming language. The Google app engine is an example of a cloud system that has

been implemented in the python script language. Another class is the high-level universal

cloud systems, such as Sun Network.com (Sun Grid) that can be implemented using

Solaris OS and programming languages like Java, C, C++, and FORTRAN.

Implementation structure is an important aspect to compare amongst different clouds

because if a cloud is implemented in a language that is obsolete, then people will hesitate

using such cloud.

2.2.4.9 VM Migration

The VM technology has emerged as a building block of data centers, as it provides

isolation, consolidation, and migration of workload. The purpose of migrating VM is to

seek improvement in performance, fault tolerance, and management of the systems over

the cloud. Moreover, in large scale systems the VM migration can also be used to balance

the systems by migrating the workload from overloaded or overheated systems to

underutilized systems. Some hypervisors, such as Vmware [113] and Xen, provides

„„live‟‟ migration, where the OS continues to run while the migration is performed. VM

migration is an important aspect of the cloud towards achieving high performance and

fault tolerance.

2.2.4.10 Pricing Model in Cloud

The pricing model implemented in the cloud is pay-as-you-go model, where the services

are charged as per the QoS requirements of the users. The resources in the cloud, such as

network bandwidth and storage, are charged on a specific rate. For example, the standard

price for block storage on HP cloud is $0.10 per GB/mo [114]. The prices of the clouds

may vary depending on the types of services they provide.

Page 49: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

32

2.2.4.11 Correlation of Cloud Features and Resource Allocation

The focus of the cloud system is an important factor for the selection of appropriate

resources and services for the cloud users. Some resources may require a specific type of

infrastructure or platform. However, the cloud computing is more service-oriented than

resource-oriented [62]. The cloud users do not care much about the resources, but are

more concerned with the services being provided. Virtualization is used to hide the

complexity of the underlying system and resources. User satisfaction is one of the main

concerns in provisioning cloud computing web services. Dynamic QoS negotiations can

only be made if the resources are available.

2.3 Comparison and Survey of the Existing HPC Solutions

In Table 2.1, a comparison of three HPC categories (cluster, grid, and cloud) is provided.

Author classifies various HPC research projects and commercial products according to

the HPC systems classification that author have developed in Section 2.1. The list of the

systems discussed is not exhaustive but is representative of the classes. The projects in

each category have been chosen in Table 2.2 based on the factors specified for each HPC

class that was reported in Section 2.1.

Table 2.2: Survey of the Existing HPC Systems

HPC

Cluster Grid Cloud

Enhanced MOSIX Gluster

Faucets

DQS Tycon

Cluster-on-demand

Kerrighed

Open SSI Libra

PVM

CONDOR REXEC

GNQS

LoadLeveler

LSF SLRUM

PBS

GRACE

Ninf

G-QoSM Javelin

NWS

GHS

Standford Peer Initiatives 2K

AppleS

Darwin Cactus

Punch

Nimrod/G Netsolve

MOL

Legion

Wren Globus

Amazon EC2

Eucalyptus

Google App Engine

GENI Microsoft Live Mesh

Sun Netwrk.com (Sun

Grid) E-Learning Ecosystem

Grids Lab Aneka

Open Stack

Page 50: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

33

2.3.1 Cluster Computing Systems

A list of representative cluster projects and a brief summary is provided in Table 2 .3. The

systems discussed below are characterized based on the generic cluster system features

highlighted in Section 2.1.

2.3.1.1 Enhanced MOSIX

Enhanced Mosix (E-Mosix) is a tailored version of Mosix project [115], which was

geared to achieve efficient resource utilization amongst nodes on a distributed

environment. Multiple processes are created by the users to run the applications. Mosix

will then discover the resources and will automatically migrates the processes among the

nodes for performance improvement without changing the run-time environment of the

processes. E-Mosix uses cost-based policy for process migration. The node in every

cluster makes the resource allocation decisions independently. Different resources are

collected and the overall system performance measure is defined as a total cost of the

resource utilization. E-Mosix supports parallel job processing mode. Moreover, migration

process is used to decrease the overall cost of job execution on different machines in the

cluster. Furthermore, a decentralized resource control is implemented and each cluster

node in the system is supplied with an autonomous resource assignment policy.

2.3.1.2 Gluster

Gluster defines a uniform computing and storage platform for developing applications

inclined towards specific tasks, such as storage, database clustering, and enterprise

provisioning [116]. The distribution of the Gluster is independent and has been tested on

a number of distributions. Gluster is an open source and scalable platform whose

distributed Gluster FS is capable of scaling up to thousands of clients. Commodity

servers are combined with Gluster and storage to form a massive storage networks.

Gluster SP and Gluster HPC are bundled cluster applications associated with Gluster. The

said system can be extended using Python scripts [116].

Page 51: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

34

System

Job

Processing

Type

QoS Attributes Job Composition

Resource

Allocation

Control

Platform

Support

Evaluation

Method

Process

Migratio

n

Enhanced

MOSIX[115]

(1999)

Parallel Cost Single Task Decentralized Heterogeneous User-centric Yes

Gluster[116](200

7) Parallel Reliability (no point of failure) Parallel Task Decentralized Heterogeneous N/A Yes

Faucets[117]

(2003) Parallel Time, Cost Parallel Task Centralized Heterogeneous System-centric Yes

DQS [120]

(1998) batch

CPU memory sizes, Hardware architecture and OS

versions. Parallel Task Decentralized Heterogeneous System-centric No

Tycoon[121]

(2004) Sequential Time, Cost Multiple Task Decentralized Heterogeneous User-centric No

Cluster-On-

Demand[123]

(2002)

Sequential Cost in terms of time Independent Decentralized Heterogeneous User-centric No

Kerrighed[124,12

5] (1999) Sequential

Ease of use, High performance, High availability,

Efficient resources management, and High

customizability of the OS

Multiple Task Decentralized Homogeneous System-centric Yes

OpenSSI [126]

(2004) Parallel Availability, Scalability and manageability Multiple Task Centralized Heterogeneous System-centric Yes

Libra[82]

(2004)

Batch,

Sequential Time, Cost Parallel Centralized Heterogeneous

System-centric,

User-centric Yes

PVM[128]

(2001)

Parallel,

Concurrent Cost Multiple Task Centralized Heterogeneous User-centric Yes

Condor [103,134]

(1998) Parallel

Throughput, Productivity of computing

environment Multiple Task Centralized

Platform

Support System-centric Yes

REXEC[129]

(1999)

Parallel,

Sequential Cost

Independent, Single

Task Decentralized Homogeneous User-centric No

GNQS[130]

(2001)

Batch,

Parallel Computing power Parallel processing Centralized Heterogeneous System-centric No

LoadLeveler

[131] (2001) Parallel Time, High Availability Multiple Task Centralized Heterogeneous System-centric Yes

LSF[132] (2007) Parallel,

Batch

Job submission simplification,

Setup time reduction and operation errors Multiple Task Centralized Heterogeneous System-centric Yes

SLURM[90]

(2003) Parallel

Simplicity, Scalability, Portability and fault

tolerance Multiple Task Centralized Homogeneous

System-centric,

User-centric No

PBS [133] (1991) Batch Time, Jobs queuing Multiple Task Centralized Heterogeneous System-centric Yes

Table 2.3: Comparison of Cluster Computing Systems

Page 52: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

35

2.3.1.3 Faucets

Faucets [117] are designed for processing parallel applications and offers an internal

adaptation framework for the parallel applications based on adaptive MPI [118] and

Charm++ [69] solutions. The number of applications executed on Faucets can vary [119].

The abovementioned process allows the utilization of all resources that are currently

available in the system. For each parallel task submitted to the system, the user has to

specify the required software environment, expected completion time, number of

processors needed for the task completion, and budget limits. The privileged scheduling

criterion in Faucets is the completion time of a job. The total cost of the resource

utilization calculated for a particular user is specified based on the bids received from the

resource providers. Faucets supports time-shared scheduling that simultaneously executes

adaptive jobs based on dissimilar percentages of allocated processors. Faucets support

parallel job processing type. Moreover, the constraints about the requirements of any

parallel task remain constant throughout the task execution. Jobs are submitted to

Faucets with a QoS requirement and subscribing clusters return bids. Moreover, the best

bid that meets all criteria is selected. Bartering is an important unit of Faucet that permits

cluster maintainers to exchange computational power with each other. Moreover, units

are awarded when the bidding cluster successfully runs an application. Users can later on

trade the bartering units to use the resources on other subscribing clusters.

2.3.1.4 Distributed Queuing System

DQS is used for scheduling background tasks to a number of workstations. The tasks are

presented to the system as a queue of applications. The queue of tasks is automatically

organized by the DQS system based on the current resource status [120]. Jobs in the

queue are sorted on the priority of subsequent submission pattern, internal sub-priority,

and the job identifier. The sub-priority is calculated each time the master node scans the

queued jobs for scheduling. The calculat ions are relative to each user and reflect the total

number of jobs in the queue that are ahead of each job. The total number of jobs includes

any of the use jobs that are in “Running” state or in “Queued” state.

2.3.1.5 Tycoon

Tycoon allocates cluster resources with different system performance factors, such as

CPU cycles, memory, and bandwidth [121, 122]. Tycoon is based on the principle of

Page 53: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

36

proportional resource sharing. Moreover, the major advantage of Tycoon is to

differentiate the values of the jobs. Communication delay is a major factor for resource

acquisition latency, and no process of manual bidding is available in Tycoon. Manual

bidding supports proficient use of different resources when no precise bids are present at

all. Tycoon is composed of four main components namely:(a) bank, (b) auctioneers, (c)

location service, and (d) agents. Tycoon [122] uses two-tier architecture for the allocation

of resources. Ref. [122] differentiates between allocation mechanism and user strategy.

Allocation mechanism offers different means to seek user assessments for efficient

execution and user strategy captures high-level preferences that vary across number of

users but are more application-dependent. The division of allocation mechanism and user

strategy permits requirements not to be restricted and dependent.

2.3.1.6 Cluster on Demand

Cluster-on-Demand allocates servers from a common pool to multiple partitions called

virtual clusters, with independently configured software environments [123]. The jobs

executed in the system are implicitly single-task applications and are ordered on the basis

of arrival time. For each job submitted to the system, the user specifies a value function

containing a constant reduction factor for the required level of services needed by the

user [123]. The value function remains static throughout the execution once the

agreement has been approved by the user. A cluster manager is responsible for

scheduling the tasks to resources from different administrative domains. The support of

adaptive allocation of resource update is available. Moreover, the cluster manager forces

less costly dedicated jobs to wait for more costly new tasks that may arrive later in the

future. A deduction can be made from the aforementioned, that no hard constraints are

supported because many accepted jobs can take more time for the completion than

anticipated. The cost measure of Cluster-on-Demand is the cost of node configuration for

a full wipe clean install. The Cluster-on-Demand uses a user-centric evaluation of cost

measure and the major cost factor is the type of hardware devices used.

2.3.1.7 Kerrighed

Kerrighed is a cluster system with a Linux kernel patch as a main module for controlling

the whole system behavior [124, 125]. For fair load balancing of the cluster, schedulers

use sockets, pipe, and char devices. Moreover, the use of devices does not affect the

Page 54: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

37

cluster communication mechanisms due to seamless migration of the applications across

the system. Furthermore, the migration of single threaded and multi-threaded applications

are supported in the process. The running process at one node can be paused and restarted

at another node. Kerrighed system provides a view of single SMP machine.

2.3.1.8 Open Single System Image

OpenSSI is an open source uniform image clustering system. Moreover, the collection of

computers to serve as a joint large cluster [126] is also supported by OpenSSI. Contrary

to Kerrighed, in OpenSSI, the number of resources that are available may vary during the

task execution. OpenSSI is based on Linux OS. The concept of bit variation process

migration that is derived from Mosix is used in OpenSSI. Bit Variation dynamically

balances the CPU load on the cluster by migrating different threaded processes. The

process management in OpenSSI is tough. A single process ID is assigned to each

process on a cluster and the inter process communication is handled cluster wide. The

limitation of OpenSSI is the support of maximum 125 nodes per cluster.

2.3.1.9 Libra

Libra takes advantage of the number of jobs based on the system and user requirements

[82]. Different resources are allocated based on the budget and deadline constraints for

each job. Libra communicates with the federal resource manager that is responsible for

collecting information of different resources presented in the cluster. In case of a mixed

composition of resources, estimated execution time is calculated on diverse worker

nodes. Libra assigns different resources to executing jobs based on the deadlines. A

centralized accounting mechanism is used for resource utilization of current jobs, to

periodically relocate time partitions for each critical job and to meet the deadlines. Libra

assumes that each submitted job is sequential and is composed of a single task. Libra

schedules tasks to internal working nodes available in the cluster [127]. Each internal

node has a task control component that relocates and reassigns processor time, and

performs partitioning periodically based on the actual execution and deadline of each

active job. The system evaluation factors to assess overall system performance are

system-centric with average waiting and response time as the parameters. However, Libra

performs better than traditional FCFS scheduling approach for both user-centric and

system-centric evaluation factors.

Page 55: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

38

2.3.1.10 Parallel Virtual Machine

PVM is a portable software package combining a heterogeneous collection of computers

in a network to provide a view of a single large parallel computer. The aim of using PVM

is to aggregate memory and power of many computers to solve large computational

problems in a cost efficient way. To solve much larger problems, PVM accommodate

existing computer hardware with some minimal extra cost. A PVM user outside the

cluster can view the cluster as a single terminal. All cluster details are hidden from the

end user, irrespective of how cluster put tasks on individual nodes. PVM is currently

being used by a number of websites across the globe for solving medical, scientific, and

industrial problems [128]. PVM is also employed as an educational tool for teaching

parallel programming courses.

2.3.1.11 REXEC

In REXEC, a resource sharing mechanism, the users struggle for shared resources in a

cluster [129]. The computational loads of the resources are balanced according to the

total allocation cost, such as credits per minute that users agreed to pay for resource

utilization. Multiple daemons select the best node to execute particular tasks that are the

key components of the decentralized resource management control system. Numbers of

jobs are mapped to the distributed resources at the same time intervals according to the

time-shared scheduling rules. REXEC supports parallel and sequential job processing.

Users specify the cost restrictions that remains fixed after the task submission. The

resource assignments already presented in the system are reassigned, whenever a new

task execution is being initialized or finished. REXEC uses an aggregate utility function

as a user-centric evaluation factor that represents the cost of all tasks on the cluster. The

end-users are charged based on the completion times of the tasks.

2.3.1.12 Generic Network Queuing System

GNQS is an open source batch processing system. The networks of computers or

applications on a single machine are scheduled through GNQS that do not allow tasks to

be executed simultaneously [130]. GNQS is not a shareware application and is

maintained by a large community across the Internet. ANSI-C language is required to

compile the code with root privileges to successfully run the GNQS on a single local

computer.

Page 56: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

39

2.3.1.13 Load Leveler

Load Leveler [131] is a parallel scheduling system developed by IBM that works by

matching the processing needs of each task and priorities of available resources. Number

of end users can execute jobs in a limited time interval by using load leveler. For high

availability and workload management, the Load Leveler provides a single point of

control. In a multi-user production environment, the use of Load Leveler supports

aggregate improvement in system performance as well as turnaround time with equal

distribution of the resources. In Load Leveler, every machine that contributes must run

one or more daemons [131].

2.3.1.14 Load Sharing Facility

LSF [132] has a complete set of workload management abilities that manages workload

in distributed, demanding, and critical HPC environments. LSF executes batch jobs. The

set of workload management and intelligent scheduling features fully utilize the

computing resources. LSF schedules a complex workload and provides a highly available

and scalable architecture. Ref [132] provides HPC components like HPC data center for

managing workload and also provides vendor support.

2.3.1.15 Simple Linux Utility for Resource Management

SLURM [90] is an open source, scalable, and fault tolerant cluster management, and job

scheduling system. SLURM is used in small and large Linux cluster environments.

SLURM provides exclusive and non-exclusive access of computing resources to users.

Then, the execution and monitoring of the allocated computing resources are performed.

Finally, the awaiting requests are accomplished.

2.3.1.16 Portable Batch System

PBS [133] provides job resource management in batch cluster environment. In HPC

environment, PBS provides the jobs information to the Moab that is a job scheduler used

in PBS. Moab decides the selection of jobs for execution. PBS selects and dispatches jobs

from the queue to cluster nodes for execution. PBS supports non-interactive batch jobs

and interactive batch jobs. The non-interactive jobs are more common. The essential

execution commands and resource requests are created in the form of a job script that is

submitted for execution.

Page 57: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

40

2.3.1.17 Condor (HTCondor)

A large collection of heterogeneous machines and networks are managed by Condor high

throughput computing environment [103, 134]. Condor shares and combines the idle

computing resources. Condor reserves the information of the originating machine

specifications through remote system call capabilities. The remote system call tracks the

originated machines when the file system or scheme is not shared among the users. A

Condor matchmaker is used to determine the compatible resource request. The

matchmaker triggers a query to the condor collector for resource information stored for

resource discovery.

2.3.2 Grid Computing Systems

A diverse range of applications is employed in computational grids. Scientists and

engineers rely on grid computing to solve challenging problems in engineering,

manufacturing, finance, risk analysis, data processing, and science [135]. Table 2.4

shows the representative grid systems that are analyzed according to the grid features

specified in Section 2.1. All the values of the features are straight forward. For some

features, we did a comparative study, such as breadth of scope or triggering information

and the values are high, medium, or low. No threshold value for the categorization is

provided.

Page 58: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

41

Table 2.4: Comparison of Grid Computing Systems

System

Sys Type

Scheduling

Organization

Resource Description

Resource

Allocation

Policy

Breadth

of scope

Triggering

Info

Sys Functionality

GRACE

[135,136]

(2005)

Computational

Not specified

can be

Decentralized/

ad -hoc

CPU process power, memory,

storage capacity and network

bandwidth

Fixed AOP High High Resources are allocated on demand

and supply

Ninf [137]

(1996) Computational Decentralized

No QoS, periodic push

Dissemination, centralized

queries discovery

fixed AOP Medium Low A global computing client-server

based system

G-QoSM

[139,140,

141]

(2002)

On-demand Requirements

matching Processing power and Bandwidth. Fixed AOP High High

SLA based resources are allocated

in the system.

Javelin [142]

(2000) Computational Decentralized

Soft QoS,

distributed queries discovery, other

network directory store,

periodic push dissemination

fixed

AOP Low Medium

A system for Internet-wide parallel

computing based on Java

NWS [143]

(1997) Hierarchical Host Capacity

end-to-end latency and available

bandwidth , availability of CPU N/A Low Low Used for short term prediction

GHS[147]

(2005) Hierarchical

Heuristics

based on Host

Capacity

availability of CPU, TCP

connection establishment

time, end-to-end latency

N/A Medium Medium

Scalability and precision in

prediction at high level than NWS

[47].

Stanford

Peer

Initiatives

Computational

Hierarchical/

Decentralized

CPU cycles, disk space, network

bandwidth

NA

High

Medium

Distribution of main costs of

sharing data, disk space for storing

files and bandwidth for transfer

2K [149,150]

(1999) On-demand

Hierarchical/

Decentralized

Online dissemination, Soft network

QoS,

agent

discovery

fixed AOP High Medium

Flexible and adaptable distributed

OS used for a wide variety of

platforms

AppLeS

[153,138]

(1997)

High-

throughput

Hierarchical/

Decentralized

Models for resources provided by

Globus, Legion, or Netsolve fixed AOP Low Medium

Produces scheduling agents for

computational grids

Darwin [154]

(2001) Multimedia

Hierarchical/

Decentralized

hard QoS , Graph

namespace

fixed system

oriented

policy (SOP)

Low High

Manages resources for network

services

Cactus

Worm [155]

(2001)

On-demand Requirements

matching N/A fixed AOP High Medium

When required performance is not

achieved the systemallows

applications to adapt accordingly.

PUNCH

[101]

(1999)

Computational

Hierarchical/

Decentralized

Soft QoS, periodic push

Dissemination, distributed

queries discovery

fixed

AOP

Medium

Medium

A middleware that provides

transparent access to remote

programs and resources.

Nimrod/G

[158,159]

High-

throughput

Hierarchical/

Decentralized

Relational

network directory data store, soft

QoS, distributed queries

fixed

AOP Medium Medium

Provides brokering services for

task farming application

Page 59: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

42

(2000) discovery

NetSolve

[162]

(1997)

Computational Decentralized

Soft QoS,

periodic push dissemination,

distributed queries discovery

fixed

AOP Medium Medium

A network-enabled application

server for solving computational

problems in distributed

environment.

MOL [164]

(2000) Computational Decentralized

Distributed queries

discovery, periodic push

dissemination

Extensible

Ad-hoc

scheduling

policies

(ASP)

Low Low

Provide resource management for

dynamic communication, fault

management, and access provision

Legion

[100,166]

(1999)

Computational

Hierarchical/

Decentralized

Soft QoS, periodic pull

Dissemination, distributed queries

discovery

Extensible

structured

scheduling

policy (SSP)

Medium Medium Provides an infrastructure for grid

based on object meta system.

Wren [102]

(2003) Grid

No mechanism

for initial

scheduling

NA Fixed AOP Low Low Provide active probing with low

overhead.

Globus [161]

(1996) Hierarchical Decentralized

Soft QoS,

network directory store,

distributed queries discovery

extensible

Ad-hoc

scheduling

policy (ASP)

High Medium

Provides basic services for

Modular deployment of grids in

Globus Meta computing Toolkit.

Page 60: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

43

2.3.2.1 Grid Architecture for Computational Economy

GRACE is a generic infrastructure for the market-based grid approach that co-exists with

other grid systems, such as Globus. The interactions of the grid users with the system are

provided through GRB. GRACE employs Nimrod-G grid scheduler [135] responsible for:

(a) resource discovery, (b) selection, (c) scheduling, and (d) allocation. The resource

brokers fulfill the user demands by optimizing execution time of the jobs and user budget

expenses, simultaneously [135]. GRACE architecture allocates resources on supply and

demand basis [136]. The resources monitored by GRACE are software applications or

hardware devices. GRACE enables the control of CPU power, memory, storage capacity,

and network bandwidth. The resources are allocated according to the fixed application-

oriented policy.

2.3.2.2 Network Infrastructure

Ninf [137] is an example of a computational grid that is based on a client -server

infrastructure. Ninf clients are connected with the servers through local area networks.

The server machine and the client machines could be heterogeneous that is why the data

to be communicated is translated into a mutual network data format [137]. The

components of the Ninf system are client interfaces, remote libraries, and a meta-server.

Ninf applications invoke Ninf libraries and the request is forwarded to the Ninf meta-

server that maintains the Ninf servers‟ directory. Ninf meta-server forwards the library

request to the appropriate server. Moreover, Ninf uses centralized resource discovery

mechanism. The computational resources are registered with meta-server through library

services [138]. The scheduling mechanism in Ninf is decentralized and the server

performs actual scheduling of the client requests. Ninf uses a fixed application oriented

policy, has a medium level breadth of scope, and provides no QoS. The triggering

information in Ninf is low.

2.3.2.3 Grid-Quality of Services Management

G-QoSM [136, 139, 140] system works under an OGSA [141]. G-QoSM provides

resource and service discovery support, based on QoS features. Moreover, guarantee of

supporting QoS at application, network, and middle grid level is also provided. G-QoSM

provides three levels of QoS namely: (a) best effort, (b) controlled, and (c) guaranteed

levels. The resources are allocated on the basis of SLA between the users and providers.

Page 61: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

44

G-QoSM utilizes SLA mechanism, so the triggering information as well as breadth of

scope is high. The scheduling organization in G-QoSM can be centralized or

decentralized. However, the main focus of G-QoSM is managing the QoS. The resources

focused by the G-QoS management are the bandwidth and processing power. G-QoSM

uses fixed application oriented policy for resource allocation.

2.3.2.4 Javelin

Javelin is a Java based infrastructure [142] that may be used as an Internet-wide parallel

computing system. Javelin is composed of three main components: (a) clients, (b) hosts,

and (c) brokers. Hosts provide computational resources; clients seek for computational

resources and resource brokers support the allocation of the resources. In Jave lin, hosts

can be attached to a broker, considering as a resource. Javelin uses hierarchical resource

management [138].If a client or host wants to connect to Javelin, then a connection with

Javelin broker has to be made that is agreed to support the client or host. The backbone of

Javelin is the BNS. The BNS is an information system that keeps the information about

available brokers [142]. Javelin has a decentralized scheduling organization with a fixed

application oriented resource allocation policy. The breadth of scope of Javelin is low

and only supports Java based applications. The triggering information of Javelin is

medium.

2.3.2.5 Network Weather Service

NWS [143] is a distributed prediction system for the network (and resources) dynamics.

The prediction mechanism in NWS is based on the adaptation strategies that analyze the

previous system states. In NWS, system features and network performance factors, such

as bandwidth, CPU speed, TCP connection establishment time, and latency are

considered as the main criteria of resource description measurements [136]. System, such

as NWS have been used successfully to choose between replicated web pages [144] and

to implement dynamic scheduling agents for meta-computing applications [145, 146].

Extensible system architecture, distributed fault -tolerant control algorithms, and adaptive

programming techniques has been illuminated by the implementation of the NWS to

operate in a variety of meta-computing and distributed environments with changing

performance characteristics. NWS uses host capacity based scheduling organization.

Page 62: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

45

Moreover, NWS works well for short time processes in restricted area grid clusters.

Therefore, the breadth of scope and triggering information of NWS is low.

2.3.2.6 Grid Harvest Service

The goal of GHS is to achieve high scalability and precision in a network [147]. . The

GHS system is comprised of five subsystems, namley: (a) task allocation module, (b)

execution management system, (c) performance measurement, (d) performance

evaluation, and (e) task scheduling modules. GHS enhances application performance by

task re-scheduling and utilizing two scheduling algorithms. First algorithm minimizes

task execution time and the second algorithm is used to assign the tasks to an individua l

resource. GHS, like NWS [143], uses host capacity based heuristics as scheduling

organization. The breadth of scope and the triggering information is of medium level in

GHS [136].

2.3.2.7 Stanford Peers Initiative

Stanford Peers Initiative utilizes a peer-to-peer data trading framework to create a digital

archiving system. Stanford Peers Initiative uses a unique bid trading auction method that

seeks bids from distant web services to replicate the collection. In response, each remote

web service replies that reflect the amount of total disk storage space [148]. The local

web service selects the lowest bid for maximizing the benefits. Because the system

focuses on preserving the data for the longest possible period, the major system

performance factor is the reliability. The reliability measure is a MTTF for each local

web service. Each web service try to minimize the total cost of trading that is usually

measured in term of disk space provided. For replicating data collection, a decentralized

management control is implemented in the system. The web service makes the decision to

select the suitable remote services. Each web service is represented as an independent

system entity. The storage space remains fixed throughout, even if a remote service is

selected.

2.3.2.8 2K

2K grid system provides distributed services for multiple flexible and adaptable platforms

[149, 150]. The supported platform ranges from PDAs to large scale computers for the

application processing. 2K is a reflective OS built on top of a reflective ORB, dynamic

TAO [151], a dynamically configurable version of TAO [152]. The key features of 2K

Page 63: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

46

are: (a) distribution, (b) user-centrism, (c) adaptation, and (d) architectural awareness. 2K

is an example of an on-demand grid system that uses agents for resource discovery and

mobile agents to perform resource dissemination functionality [138].2K uses

decentralized and hierarchical scheduling organization and a fixed application oriented

resource allocation policy. In 2K system, no mechanism for rescheduling is supported.

The breadth of scope of the 2K system is high due to multiple ranges of platforms.

Moreover, the triggering information is medium due to soft QoS provisioning.

2.3.2.9 AppLeS

AppLeS is an application level grid scheduler that operates as an agent in a dynamic

environment. AppLeS assists an application developer by enhancing the scheduling

activity. For each application, an individual AppLeS agent is designed for resource

selection. AppLeS agents are not utilized as resource management system. Moreover, the

projects such as Legion or Globus grid packages [153] utilize resource management

systems. AppLeS is used for computational purposes. AppLeS provide templates that are

used in structurally similar applications. AppLeS utilize hierarchical or decentralized

schedulers and a fixed application oriented policy for resource allocation [138]. The

triggering information in AppLeS is medium and the breadth of scope is low.

2.3.2.10 Darwin

Darwin is a grid resource management system that provides value-added network

services electronically [154]. The main features of the system are: (a) high level resource

selection, (b) run-time resource management, (c) hierarchical scheduling, and (d) low

level resource allocation mechanisms [154]. Darwin utilizes hierarchical schedulers and

online rescheduling mechanisms. The resource allocation policy in Darwin is fixed

system-oriented. To allocate resources globally in grid, the system employs a request

broker called Xena [154]. For resource allocation at higher level, Darwin uses an H-FSC

scheduling algorithm. Darwin runs in routers and provides hard network QoS.

2.3.2.11 Cactus Worm

Cactus Worm [139 155] is an on-demand grid computing system. Cactus Worm supports

an adaptive application structure and can be characterized as an experimental framework

that can handle dynamic resource features. Cactus supports dynamic resource selection

for resource interchange through migration. The migration mechanism is performed only

Page 64: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

47

when the performance is not adequate [155]. Cactus Worm supports different

architectures namely: (a) uni-processors, (b) clusters, and (c) supercomputers [156]. The

scheduling organization in Cactus Worm is based on requirements matching (Condor)

and uses a fixed-based AOP for resource allocation. The functionality of the Cactus

Worm is expressed in adaptation of the resource allocation policy if the required

performance level is not achieved. Moreover, the breadth of scope is high and triggering

information is at the middle level in Cactus Worm.

2.3.2.12 PUNCH

PUNCH is a network-based computing middleware test-bed that provides OS services in

a distributed computing environment [101]. PUNCH is a multi-user and multi-process

environment that allows: (a) a transparent remote access to applications and resources,

(b) access control, and (c) job control functionality. PUNCH supports a virtual grid

organization by fully decentralized and autonomous management of resources [157]. The

key concept is to design and implement a platform that provides independence between

the applications and the computing infrastructure. PUNCH possesses hierarchical

decentralized resource management and predictive machine learning methodologies for

mapping the jobs to resources. PUNCH uses: (a) an extensible schema model, (b) a

hybrid namespace, (c) soft QoS, (d) distributed queries discovery, and (e) periodic push

dissemination as resources. The resources are allocated according to the fixed application

oriented policy. The functionality of PUNCH systems is expressed in terms of flexible

remote access to the user and the computing infrastructure of the application. The

breadth of scope and triggering information of PUNCH are at the middle level.

2.3.2.13 Nimrod/G

Nimrod/G is designed to seamlessly execute large-scale parameter study simulations,

such as parameter sweep applications through a simple declarative language and GUI on

computational Grids [158, 159]. Nimrod/G is a grid resource broker for managing and

steering task farming applications and follows a computational market -based model for

resource management. Nimrod/G strives for low cost access to computational resources

using GRACE services. Moreover, the user defined constraints cost is minimized using

adaptive scheduling algorithms. Beside the parameter studies, Nimrod/G also provides

support for a single window to: (a) manage and control experiments, (b) discover

Page 65: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

48

resources, (c) trade resources, and (d) perform scheduling [160, 161]. Nimrod/G uses a

task performing engine that generates user defined scheduling policies. Nimrod uses

hierarchical decentralized scheduler and predictive pricing models as scheduling

organizations. Moreover, Nimrod/G uses resource descriptions, such as (a) relational

network directory data store, (b) soft QoS, (c) distributed queries discovery, and (d)

periodic dissemination. Fixed application-oriented policy driven by user-defined

requirements, such as deadline and budget limitations are used in Nimrod/G for resource

allocation. Active sheets is an example of Nimrod/G used to execute Microsoft Excel

computations/cells on the Grid [162].

2.3.2.14 NetSolve

NetSolve [163] is an application server based on a client-agent-server environment.

NetSolve integrates distributed resources to a desktop application. NetSolve resources

include hardware, software, and computational software packages. TCP/IP sockets are

used for the interaction among the user, agents, and servers. The server can be

implemented in any scientific package. Moreover, the clients can be implemented in C,

FORTRAN, MATLAB, or web pages. The agents are responsible for locating the best

possible resources available in the network. Once the resource is selected, the agents

execute the client request and return the answers back to the user. NetSolve is a

computational grid with a decentralized scheduler. NetSolve uses soft QoS, distributed

queries discovery and periodic push dissemination for the resource description.

Moreover, a fixed application oriented policy is used for resource allocation. Breadth of

scope and triggering information of NetSolve is medium as scalability is limited to

certain applications.

2.3.2.15 Meta Computing Online

MOL system consists of a kernel as the core component of system and provides the basic

infrastructure for interconnected resources, users, and third party meta-computer

components [164]. MOL supports dynamic communications, fault management, and

access provisions. The key aspects of MOL kernel are reliability and flexibility.

Moreover, MOL is the first meta-computer infrastructure that does not reveal a single

point of failure [165]. MOL has a decentralized scheduler and uses: (a) hierarchical

namespace, (b) object model store, and (c) distributed quer ies discovery as resource

Page 66: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

49

description. MOL uses extensible ad-hoc scheduling policies for resource allocation. No

QoS support is available, so the triggering information and breath of scope are low.

2.3.2.16 Legion

Legion is a software infrastructure that aims to connect multiple hosts ranging from PCs

to massive parallel computers [100]. The most important features that motivate the use of

legion include: (a) site autonomy, (b) support for heterogeneity, (c) usability, (d) parallel

processing to achieve high system performance, (e) extensibility, (f) fault tolerance, (g)

scalability, (h) security, (i) multi-language implementation support, and (j) global naming

[166]. Legion appears as a vertical system and follows a hierarchical scheduling model.

Legion uses distributed queries for resource discovery and periodic pull for

dissemination. Moreover, Legion uses extensible structured scheduling policy for

resource allocation. One of the main objectives of the Legion is the system scalability

and high performance. The breadth of scope and triggering information are high in

Legion.

2.3.2.17 Wren

Wren is a topology-based steering approach for providing network measurement [100].

The network ranges from clusters to WANs and uses information about the possible

bottlenecks that may occur in the networks from topologies. The information is useful in

steering the measurement techniques to calculate the channels where the bottlenecks may

occur. Passive and active measurement systems are combined by Wren to minimize the

measurement load [100]. Topology-based steering is used to achieve the load

measurement task. Moreover, no mechanism for initial scheduling is available and a fixed

application policy is used for resource allocation. Furthermore, Wren has limited

scalability that results in low breadth of scope. No QoS attributes are considered in

WREN and, the triggering information is also low.

2.3.2.18 Globus

The Globus system achieves a vertically integrated treatment of applications, networks,

and middleware [161]. The low level toolkit performs: (a) communication, (b)

authentication, and (c) access. Meta computing systems has the problem of configuration

and performance optimization.

Page 67: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

50

2.3.3 Cloud Computing Systems

Numerous cloud approaches tackle complex resource provisioning and programming

problems for the users with different priorities and requirements. Eight examples of cloud

computing solutions are summarized and characterized under the key system features in

Table 2.5.

2.3.3.1 Amazon Elastic Compute Cloud

Amazon EC2 [167] is a virtual computing environment that enables a user to run Linux-

based applications. Amazon EC2 provides a rental service of VM on the Internet [79].

Amazon‟s EC2 service has become a standard-bearer for IaaS providers and provides

many different service levels to the users [168]. Depending on the individual user

choices, a new „Machine Image‟ based on: (a) application types, (b) structures, (c)

libraries, (d) data, and (e) associated configuration settings can be specified. The user can

also choose the available AMIs in the network and upload AMI to S3. The machine can

reload in a shorter period of time, for performing flexible system operations. Moreover,

the whole system load time increases significantly. Virtualization is achieved by running

the machines on Xen [169] at OS level. The users interact with the system through

Amazon EC2 Command-line tools. Amazon EC2 is built through customizable Linux-

based AMI environment.

2.3.3.2 Eucalyptus

Eucalyptus [80] is a Linux-based open source software framework dedicated for cloud

computing. Eucalyptus allows the users to execute and control the entire VM instances

deployed across a variety of physical resources. Eucalyptus is composed of an NC that

controls the: (a) execution, (b) inspection, and (c) termination of VM instances on the

hosts running CC [110]. CC gathers information about VM and schedules VM execution

on specific NCs. Moreover, CC manages virtual instance networks. A STC called

“Walrus”, a storage service, provides a mechanism for storing and accessing VM images

and user data [110]. Cloud Controller is the web service entry point for users and

administrators that make high level scheduling decisions. Eucalyptus high-level system

components are implemented as web services in a system [110]. Instance manager is

responsible for virtualization in Eucalyptus. Moreover, Amazon EC2‟s SOAP and Query

interfaces provide the system access to the user. Dynamic QoS negotiation is performed

in Eucalyptus by Group Managers, who collect information through resource services.

Page 68: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

51

2.3.3.3 Google Application Engine

GAE [170] is a freeware platform designed for the execution of web applications. The

applications are managed through a web-based administration console [171]. GAE is

implemented using Java and Python. GAE provide users with a facility of authorization

and authentication as a web service that lifts burden from the developers. Other than

supporting Python standard library and Java, GAE also supports APIs for (a) Data store,

(b) Google accounts, (c) URL fetch, (d) image manipulation, and (e) Email services.

2.3.3.4 Global Environment for Network Innovations

GENI provides a mutual and grouping environment for academia, industry, and public to

catalyze revolutionary discoveries and innovation in the emerging field of global

networks [107]. The project is sponsored by the National Science Foundation and is open

source and broadly inclusive. GENI is a “virtual laboratory” for exploring future internets

at scale [156]. The virtualization is achieved through network accessible APIs. GENI

creates major opportunities to: (a) understand, (b) innovate, (c) transform global

networks, and (d) interactions with society. GENI enables researchers to play with

different network structures by running experimental systems within private isolated

slices of a shared test-bed [172]. The user can interact with the GENI interface through

slice federation architecture 2.0 [173]. Dynamic QoS negotiation is also incorporated

through clearing house based resource allocation. GENI can be implemented in: (a) SFA

(PlanetLab), (b) ProtoGENI, and (c) GCF based environment.

Page 69: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

52

Table 2.5: Comparison of Cloud Computing Systems

System System Focus Services Virtualization Dynamic QoS

Negotiation

User Access

Interface Web APIs

Value

added

Services

Implementation

Structure

Amazon Elastic

Compute Cloud

(EC2)

(2006)

Infrastructure

Compute,

Storage

(Amazon S3)

OS level

running on a

Xen hypervisor

None EC2 Command-line

Tools Yes Yes

Customizable Linux-based

AMI

Eucalyptus

(2009)

Infrastructure

Compute,

Storage

Instance

Manager

Group Managers

through

resource services

EC2‟s SOAP and

Query Interfaces.

Yes Yes

Open Source

Linux-Based

Google App

Engine

(2008)

Platform Web

Application

Application

Container None

Web-based

Administration

Console

Yes No

Python

GENI

(2007)

Virtual

Laboratory

Compute

Network

Accessible

APIs

Clearing House based

Resource Allocation

Slice Federation

Architecture 2.0

Network

Accessible

APIs

Yes SFA (PlanetLab),

ProtoGENI and GCF

Microsoft Live

Mesh

(2005)

Infrastructure Storage OS Level None

Web-based Live

Desktop and Any

Devices with Live

Mesh Installed

N/A No N/A

Sun

Network.com

(Sun Grid)

(2007)

Infrastructure Compute

Job

Management

System (Sun

Grid Engine)

None

Job Submission

Scripts, Sun Grid

Web portal

Yes Yes Solaris OS, Java, C, C++,

FORTRAN

E-learning

Ecosystem

(2007)

Infrastructure

Web

Application

Infrastructure

Layer

None

Web-based Dynamic

Interfaces

Yes Yes

Programming Models

Available in ASP.Net for

Front end and

Any Database like SQL,

Oracle at the Back end

GRIDS Lab

Aneka

(2008)

Software

Platform for

enterprise

clouds

Compute

Resource

Manager and

Scheduler

SLA-based

Resource Reservation

on Aneka Side

Workbench, Web-

based portal Yes No

APIs Supporting Different

Programming Models in

C# and.Net Supported

OpenStack

(2011)

Software

Platform

Compute,

Storage, web

Image

Compute, Web

Image Service None REST interface Yes Yes N/A

Page 70: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

53

2.3.3.5 Microsoft Live Mesh

Microsoft Live Mesh aims to provide remote access to applications and data that are

stored online. The user can access the uploaded applications and data through web-based

live desktop or live mesh software [74]. The Live Mesh software uses Windows live

login for password-protection and is authenticated when all files transfers are protected

using SSL [52]. The concept of virtualization is implemented at the OS level. Any

machine having live mesh installed can access Microsoft Live Mesh or web-based live

desktop.

2.3.3.6 Sun Network.Com (Sun Grid)

Sun Grid belongs to cloud that offers its services as PaaS. Sun Grid [174, 175] is used to

execute Java, C, C++, and FORTRAN based applications on the cloud. For running an

application on Sun Grid, the user has to follow a certain sequence of steps. First, the user

has to build and debug the applications and scripts at a local development environment.

The environment configuration must be similar to that on the Sun Grid [175]. Secondly, a

bundled zip archive (containing all the related scripts, libraries, executable binaries, and

input data) must be created and then uploaded to Sun Grid. The virtualization is achieved

through a job management system commonly termed as Sun Grid Engine. Lastly, the Sun

Grid web portal or API can be used to execute and monitor the application. After the

completion of application execution, the results can be downloaded to the local

development environment for viewing [174, 175].

2.3.3.7 E-Learning Ecosystem

E-Learning Ecosystem is a cloud computing based infrastructure used for the

specification of all components needed for the implementation of e-learning solutions

[176, 177, 178]. A fully developed e-learning ecosystem may include: (a) web-based

portal, (b) access learning program, and (c) personal career aspirations. The purpose is to

facilitate the users or employees to: (a) check the benefits, (b) make changes to medical

plans, and (c) learn competencies that tie to the business objectives [177]. The focus of

an e-learning ecosystem is to provide an infrastructure that applies business discipline to

manage the learning assets and activity of the entire enterprise. The virtualization is

implemented at the infrastructure layer [179]. Web based dynamic interfaces are used to

interact with the users. Moreover, some value added services are also provided on

Page 71: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

54

demand to exclusive users.

2.3.3.8 Grids Lab Aneka

Grids Lab Aneka [179] is a service oriented architecture used in enterprise grids. The aim

of Aneka is to provide a development of dynamic communication protocols that may

change the preferred selection at any time. Grids Lab Aneka supports multiple

application models, persistence, and security solutions [179, 180]. Virtualization is an

integral part and is achieved in Aneka through the resource manager and scheduler. The

dynamic QoS negotiation mechanism is specified based on the SLA resource

requirements. Moreover, Aneka addresses deadline (maximum time period that

application needs to be completed in) and budget (maximum cost that the user is willing

to pay for meeting the deadline) constraints. The user access is provided by using a

workbench or a web-based portal along with value added services.

2.3.3.9 OpenStack

OpenStack is a large-scale open source community maintained software made by the

collaboration of programmers for producing an open standard operating system that runs

clouds for virtual computing or storage for both public and private clouds. OpenStack is

composed of three software projects: (a) OpenStack Compute, (b) OpenStack Object

Storage, and (c) OpenStack Image Service [181]. OpenStack Compute produces a

redundant and scalable cloud computing platform by provisioning and managing large

networks of VM. OpenStack Object Storage is a long-term storage system that stores

multi peta bytes of accessible data. OpenStack Image Service is a standard REST

interface for querying information about virtual disk images. OpenStack is an open

industry standard with massively scalable public cloud. Moreover, OpenStack avoids

proprietary vendor lock-in by supporting all available Hypervisors abide by Apache 2.0

licensing.

2.4 Classification of Systems

The systems of each category (cluster, grid, and cloud) under software only and hardware

or hybrid only systems have been classified in the following section and is shown in

Table 2.6. The software only classification is composed of tools, mechanisms, and

policies. The hardware and hybrid classification is comprised of infrastructures or

Page 72: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

55

hardware oriented solutions. Any change in the hardware design and OS extensions is

done by the manufacturers. The hardware and OS support can be cost prohibitive to the

end-users. However, programming in the case of the addition of new hardware and

software can result in more time and computational cost used. Moreover, the

programming can become a big burden to end-users. The cost to change hardware and

software at the user level is the least amongst all the costs associated with the system.

2.4.1. Software Only Solutions

The software only solutions are the projects that are distributed as software products,

components of a software package, or as a middleware. The distinguished feature of

software only solution is the controlling mechanism or job scheduler. As an example of

such systems, DQS and GNQS cluster queuing systems can be considered. Moreover, the

crucial component is the queue management module. The other examples include

OSCAR and CONDOR grid software packages. For many grid approaches, the

middleware layer is a crucial layer [160]. In fact, for research purposes, the grid system

can be reduced just to software only layer. Therefore, most of the grid systems presented

in Table 2.6 is categorized as the pure software solutions.

2.4.2. Hardware/Hybrid Only Solutions

Hardware/hybrid class of HPC systems is usually referred to as the multi-level cloud

systems. Because the clouds are used for business purposes, strict integration of the

service software application is needed with the physical devices. Moreover, the

intelligent software packages are specially designed and dedicated.

Page 73: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

56

Table 2.6: Classification of Grid, Cloud and Cluster systems into Software and Hybrid/Hardware

approaches

Software Only Systems Hybrid and Hardware Only

Systems

Cluster Systems

OpenMosix, Kerrighed, Gluster,

Cluster-On-Demand, Enhanced

MOSIX, Libra, Faucets, Nimrod/G,

Tycoon, DQS, PVM, LoadLeveler,

SLURM, PBS, LSF, GNQS

OpenSSI

Grid Systems

G-QoSM, 2K, Bond, Globus, Javelin,

Legion, Netsolve, Nimrod/G, Ninja,

PUNCH, MOL, AppLeS, Condor,

Workflow Based Approach, Grid

Harvest Service,

Cactus Worm, Network Weather

Service

GRACE, Ninf

Cloud Systems

OpenStack, Eucalyptus

Amazon EC2, Sun Grid, Google App

Engine,

GRIDS Lab Aneka, Microsoft Live

Mesh, GENI, E-learning ecosystem

2.5 Conclusion of the Chapter

The analysis of resource allocation mechanisms of distributed high performance

computing (Cluster, Grid and Cloud) were considered in the Chapter. The conclusion of

the chapter is; firstly, the three categories of distributed high performance computing are

analyzed on the basis of commonalities between them. Secondly, systems in each

category are analyzed and compared based on selected features for each category.

Finally, the analyzed systems in each category are classified into software and

hybrid/hardware i.e. a particular system in a category of high performance computing is

software, hardware or hybrid only solution.

Page 74: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

57

Chapter 3

Power Efficient Resource Allocation Using Least Feasible

Speed

Page 75: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

58

3.1 Introduction

In the previous chapter, the author explores distributed HPC systems from resource

allocation point of view. To do this, firstly the author explores and takes some common

features from plethora of text on the distributed HPC systems. The author compares the

distributed HPC systems based upon those features. Secondly, the author explores and

takes common features for the individual distributed HPC category and compares systems

of each category on those features respectively. Finally, the author investigates and

classifies all the distributed HPC systems into pure software only solutions or into

hardware or hybrid only solutions. In order to incorporate power efficiency in distributed

HPC systems, the author takes the other dimension of the HPC that is the multi-core.

While keeping distributed HPC systems in mind, the author selects the multi-core end of

the HPC for experimental results of his proposed techniques because of shorter distance

between cores due to which cache coherency is more in multi-core as compared to

distributed HPC systems. Also the signal degradation and data transfer time is less.

In this chapter, the author proposes a novel generic technique called LFS. The work

presented in this chapter identifies the implicit disadvantage associated with existing

counterpart i.e., FFS approach. Furthermore the author investigates properties and bounds

that enable to identify a procedure, which can further reduce speed. The FFS approach

calculates speed at the first scheduling point while LFS approach calculate the speed on

all scheduling and takes that scheduling point on which the task is feasible and the speed

in minimum. This chapter also presents a simple core load balancing procedure i.e.,

lightest task shift procedure. The approach presented in this chapter can fine-tune the

system so that all the cores/processing units operate on the same clock rate and have

equally proportionate core utilization. The description of LFS is given below.

LFS: In this approach firstly, all scheduling points for every task in a task set are

calculated. Secondly, feasibility of every task is checked on all its corresponding

scheduling points. Thirdly, speeds for every task are calculated on all its

corresponding feasible scheduling points. Finally the minimum speed amongst all

calculated speeds is taken for a task on which the task is schedulable.

Page 76: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

59

3.2 System Model and Background

In the periodic model of hard real-time systems, a task i is described by: (i) a task period iP

that is the time between any two consecutive instances of i , (ii) a worst-case execution time

iC that is scalable with core speed, and (iii) a relative deadline iD . A task i must have iC

units of CPU shares by iD . However, iC varies considerably at run time. All the

aforementioned parameters are integers. The task set 1 2{ , ,..., }n consists of n tasks and

can be divided into subsets such that 1 2{ , ,..., }o , where 1 1 2{ , ,..., }k and

2 1 2{ , ,..., }k k i , and so on. Moreover, a set of cores 1 2{ , ,..., }m , ( )m n is

available. The system speed if is within a predefined range 0.1,1.0 , with a step size of 0.01 .

The unit of speed is taken arbitrary (hertz, MIPS or percentage (%)). In our case we measured

speed in terms of percentage (%). Task set size is measured in number of tasks say n while

step size is incrementated amount of a value in case of speed its unit is arbitrary and in case of

tasks the incrementation is one. The core utilization of an individual task i is given as

( ) ii i

i

CU

P and the cumulative core utilization is denoted by

1( )

ki

i ii

CU

P . The problem that

we are addressing here is to map over under the fixed priority scheduling paradigm.

Core utilization and energy consumption is measured in terms of percentage. The first

feasibility test for an RM scheduling on a uniprocessor (also to be understood as a uni-

core) system was reported in [29], which was termed as the LL-bound. The LL-bound

states that a periodic task system where d p is static priority feasible if and only if

1

( ) 2 1niU n (1)

Where n denotes the number of tasks in . The term 1

2 1nn decreases monotonically

from 0.83 (when 2n ) to ln (2) as n . This result mandates that any periodic task set

of any size is static priority feasible on a preemptive uniprocessor if and only if the RM

scheduling is used and ( )iU is not greater than 0.693 . This result gives a simple ( )O n

procedure to test the task feasibility when tasks arrive at run time. However, the

abovementioned is only a sufficient condition. Therefore, it is quite possible that an

Page 77: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

60

implicit-deadline synchronous periodic task system that exceeds the LL-bound to be

static-priority feasible. The LL-bound for the RM paradigm is quite pessimistic.

Therefore, it has been proven that for the average case [49]:

0.88iU (2)

A better utilization based test, termed the HB was detailed in [182]. Using the HB test, a

periodic task set is deemed schedulable if and only if

1

1 2n

i i

h

U

(3)

The classic work reported in [29] was later extended by modifying the task parameters in

[43]. However, all the aforementioned tests cover only the SC and trade utilization for

performance.

One possible solution to the aforesaid problem is to first equally distribute a given

workload among all the cores and then to find the feasibility of using the RM bounds,

such as the LL-bound [29] or the H-bound [182]. However, these bounds provide only the

sufficient conditions and a thick share of the core utilization is compromised for

schedulability. To the best of our knowledge, this work is the first to: (i) derive the exact

RM scheduling conditions for a multi-core system and (ii) determine a uniform lowest

possible system speed for a given workload that maintains system feasibility. Symmetric

performance among cores is only possible when all the cores operate at the same low

speed. The disadvantage associated with higher core frequency is that of leakage power,

i.e., higher clock frequency increases the system power leakage. Therefore, cores must

operate on the same minimum possible frequency for the following two reasons to: (i)

avoid power leakage and (ii) conserve energy by allowing cores to execute tasks at a

constant speed.

In our proposed model, we assume that a processor has 10 major operational levels as

detailed in the Table 3.1. Let if denote a speed level and the corresponding range is given

by if , as per our processor specifications. If for a particular speed 0

if that is

unavailable within the range if (minor levels), then the next (higher) nearest value is

assigned to 0

if from the range if . We must note that if is the highest possible speed

Page 78: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

61

within a level. Therefore, any task i that is schedulable with any speed in if , is also

schedulable with if (for any i , maxi if f ). However, the converse may not hold.

Initially, we assume to be scheduled on a single core. For our model to be as close as

possible to the real-world scenarios, we opt for a constrained task model i iD P . Let

time 0t be the critical instant and the cumulative work load of a task i at any instance

of time t running at speed , 1i i if f f is to be represented by

Table 3.1: Operational levels and the respective speed ranges.

Level i if if (respective subranges/minor-levels for speed if )

0 0.1 0.01,0.02,...0.09,0.10

1 0.2 0.11,0.12,...0.19,0.20

2 0.3 0.21,0.22,...0.29,0.30

3 0.4 0.31,0.32,...0.39,0.40

4 0.5 0.41,0.42,...0.49,0.50

5 0.6 0.51,0.52,...0.59,0.60

6 0.7 0.61,0.62,...0.69,0.70

7 0.8 0.71,0.72,...0.79,0.80

8 0.9 0.81,0.82,...0.89,0.90

9 1.0 0.91,0.92,...0.99,1.00

1

1

( )i

i j

j j

i

i

tC C

PL t

f

(4)

The classic work reported in [49] details a solution that a task i is always feasible on a

generic core i at any instance of time t if and only if

mini

it S

L t t

(5)

Where t is a scheduling point and iS denotes a set of all the scheduling points constituted

by 1,..., ; 1,.... ii j

j

PS lP j i l

P

. The whole of the task set becomes RM feasible when

1,...,

minmax 1i

it S

i k

L t

t

(6)

Page 79: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

62

3.3 Lowest Speed Calculations

In this section and in Section 3.4, we address the problem of scheduling hard-deadline

periodic tasks on a multi-core environment. Section 3.3 details the simulation results for

uniprocessor systems, which are extended to encompass the multi-core counterpart in

Section 3.4.

E. Humenay et.al [20] reports that the performance of the cores is asymmetric. Therefore,

tasks cannot be assigned to the cores with the implicit assumption that all the cores are

operating at the maximum clock frequency. Moreover, heat dissipation increases when

processors operate at higher clock rates. Because of the abovementioned issues pertaining

to higher clock rates, we must first determine the appropriate core performance and once

that is known, uniform system speed can be calculated by distributing the workload

among the cores based on some schedulability tests. It has been reported in [34] that the

bin-packing technique allows only half of the core utilization and the technique trades

utilization at the cost of performance. To overcome the aforesaid gap of50% , we derive

and utilize the necessary and sufficient condition. For our analysis, and for simplicity, we

assume that initially the system is a single core entity. Once the average core speed is

determined, we relax the abovementioned assumption to accommodate multiple cores.

A task i is schedulable on a generic core i if and only if Eq. (5) holds true. However,

it is possible that the task may also be schedulable at a lower core speed. Therefore, we

add the speed component into the schedulability analysis to determine the required task

speed, which can be represented by

1

1min

i

i j

j

i

i

j

t S

tC C

Pf

t

(7)

Any value of it S that satisfies Eq. (7) ensures that i is also schedulable with speed if .

However, for different values of t , there could be a set of respective speed levels,

guarantying the schedulability of i .

Page 80: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

63

Ref. [46] reports a methodology that for a given workload returns the speed determined at

the first feasible point in the scheduling point set, termed FFS. Therefore, as soon as the

schedulability is confirmed at the first true scheduling (the time where i is schedulable),

the value of if is also determined. From the aforementioned discussion, an interesting

observation can be made that we state below.

Observation 1. The set of scheduling points iS for task i is always in a non-decreasing

order and the first value of it S that satisfies Eq. (7) does not guarantee the lowest

system speed required.

To further elaborate on Observation 1, we highlight the point with the help of an example

task set given below.

Example 1. Given three tasks 1 1.1,3 , 2 1,5 , 3 1,10 , where each task i is represented

by its parameters iC and iP , as an ordered pair ,i i iC P . Determine the lowest core speed to

schedule the lowest priority task 3 , in addition to higher priority tasks 1 and 2 .

According to the RM scheduling theory, task 3 is schedulable if and only if it satisfies

Eq. (7). Task 3 has a set of scheduling points 3 3,5,6,9,10S .

List 1. Task 3 is RM-schedulable if and only if

1 2 3 3C C C

1 2 32 5C C C

1 2 32 2 6C C C

1 2 33 2 9C C C

1 2 34 2 10C C C

It can be observed that, in the presence of the workload due to 1 and 2 , task 3 is also

schedulable at points 5 , 6 , 9 and 10 . The speed required at the respective points becomes

0.84 , 0.86 , 0.70 and 0.74 . The lowest speed is 0.70 that is achieved at the scheduling

point 9 , which is the fourth element in set 3S . Therefore, the first element does not

always guarantee the lowest system speed. From the aforementioned discussion, we can

Page 81: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

64

conclude that all the values of it S need to be tested for finding the lowest core speed

for task i . That is, if may be obtained by the following equation

1

1min max

i

i j

j

it S

i

j

tC C

Pf

t

(8)

Figure 3.1: Gantt chart for 1 1.30,3 , 2 1.19,5 and 3 1.19,10 .

Figure 3.2: Gantt chart for 1 1.57,3 , 2 1.42,5 and 3 1.42,10 .

Page 82: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

65

Figure 3.1 and 3.2 depict the Gantt charts for the task set given in Example 1. The charts

are drawn for the task set at the speeds of 0.84 and 0.70 , respectively. The values for iC

are rounded off to two decimals points to avoid cumbersome Gantt charts. We must note

that, irrespective of the representation of decimal fractions, Eq. (8) always results in the

exact same analysis and always respects the timing constraints of the task set. In both

cases, the entire task set is schedulable with lower speeds. The task set when executed at

the speed of 0.84 becomes 1 1.30,3 , 2 1.19,5 , 3 1.19,10 and the same is reflected

in Figure 3.1. It can be observed from Figure 3.1 that, after scheduling all the jobs of the

tasks, there still are 1.53: 4.98,5 7.49,9 time slots unused and these slots can further

be utilized for lowering the system speed. Similarly, Figure 3.2 reflects the Gantt chart

for the modified task set when executed at a lower speed of 0.70 and the original task set

(give in Example 1) is transformed into 1 1.57,3 , 2 1.42,5 , 3 1.42,10 . In contrast

to Figure 3.1, there are only 0.03: 8.97,9 unused slots in Figure 3.2, which is a clear

advantage and results in maximum system utilization.

3.4 Experimental Analysis

This section is devoted to experimental analysis and the results can easily be analyzed

from figures used in this section.

3.4.1 Determining the Lowest Speed

In this section, we evaluate the performance of our proposed technique, LFS, by

comparing it with the previously mentioned FFS methodology. Both the abovementioned

methodologies are compared from the perspective of system speed. The lower the speed,

the better is the technique.

To compare both techniques, random task sets of sizes within the range of [5, 50] were

generated, with a step size of 1. The plots reported in this Chapter are the average values

of 300 runs of all the task sets 5 through 50. The task periods were randomly generated

from a uniformly distributed range of [100, 10,000]. To obtain the corresponding task

execution demands iC for i , random values were taken from within the range of 1, iP ,

also with uniform distribution. The priorities were assigned to the tasks as per RM

scheduling rules. That is, the smaller the task period, the higher is the task priority. To

Page 83: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

66

have a feasible RM schedulable task set, initially, we keep the system utilization at 0.69

or ln 2 , which is quite low. This low system utilization ensures that all the tasks within

a given task set are RM feasible. Otherwise, it is very likely that some of the tasks may

not be RM feasible when the system utilization is kept high. Moreover, this also will

pertain to an unfair comparison.

The author uses MATLAB as a simulations tool for the results obtained in Figure 3.3 of

this research work. In Figure 3.3 the author uses the FFS required speed as a benchmark

for comparison with varying utilization of computing unit i.e., core. Figure 3.3 depicts

the advantage of the LFS methodology over the FFS [46] technique. It can be observed

that the LFS approach continues until it finds the minimum possible speed for a given

task set, while maintaining task set schedulability. In contrast, the FFS procedure stops

searching for the scheduling points immediately as soon as it finds the first feasible point.

The difference between both techniques is quite large. For instance, for the task set

having only 5 tasks, the system speed required by the LFS methodology is much lower

than compared to the FFS procedure.

To further illustrate the effectiveness of the proposed methodology, we report several

simulation results with different system utilization. Figure 3.3(a) reports that, with an RM

schedulable task set, the LFS procedure is always able to execute all the tasks with lesser

system speed. This includes the task set with 50 tasks. As shown in Figure 3.3(d), higher

system speed is required for larger task sets. This is an understandable phenomenon,

because when the workload increases, more computational cycles are needed to complete

all the tasks by their respective deadlines. From the plots, we can also see that the

aforementioned behavior is exhibited by both the techniques as expected.

Although the FFS technique is based on the necessary and sufficient conditions of the

RM scheduling theory, the system feasibility is a must and it is maintained with the FFS

approach. However, when the task set size increases, the FFS procedure allows the

system speed to grow very rapidly to accommodate the resource requirements to ma intain

the deadline constraints. On the other hand, our proposed methodology gradually

increases the system speed in accordance with the principal objective, which is to

conserve energy as much as possible by allowing the system to operate at a clock rate

that is the slowest and keeping the task deadline constraints intact. Figure 3.3(a) through

Figure 3.3(d) reveals the performance of both techniques with the system utilization kept

Page 84: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

67

at 70%, 80%, 90%, and 100%. It can be observed that when the system utilization

increases, the task computational demands also increase and both techniques need more

system speed to accommodate the workload presented. Therefore, the system operates at

a higher clock rate.

(a) 0.69.iU (b) 0.8.iU

(c) 0.9.iU (d) 1.0.iU

(n) (n)

(n) (n)

(%)

(%)

(%)

(%)

Figure 3.3: Effect of utilization on system speed.

The scheduling schemes studied here focus on distributed processing systems particularly

cluster and also be implemented on multi-core as the author is considering. The proposed

solution can be extended for scheduling across grids and clouds with variable delays

between computing units, having some implications and open challenges such as: (i)

mechanisms for the solution of heterogeneity, various administrative domains and user

Page 85: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

68

privileges. (ii) selection mechanism of centralized server amongst the local servers that

runs a centralized dispatcher is necessary. (iii) maintenance mechanisms of various

queues on the dispatcher such as the request queue, server record queue etc. (iv)

mechanisms for meeting QoS requirements of customers and strategies for the fulfillment

of SLAs. Etc.

3.4.2 Energy Savings

As indicated in the introductory passage, the DVS is a promising technique for lowering

the power consumption of a CMOS circuitry. Before presenting our experimental analysis

using the DVS technique, we establish the necessary formulations for the DVS

methodology from the previous literature [22, 23, 24, 35].

The average power dissipation avgP of modern processors is composed of four parts.

avg leak cap std by shortP P P P P (9)

Where leakP , capP , std byP , and shortP denotes the power leakage, capacitive, standby, and

short-circuit power, respectively. The most critical component of Eq. (9) is the term capP .

Therefore, we can ignore the rest of the terms as in Ref. [24]. Being the dominating term,

capP can be expressed as:

2

cap ddP V f (10)

Where represents a transition activity dependent parameter and the switched

capacitance, and ddV is the supply voltage. Eq. (10) indicates the quadratic dependence

of ddV and f . It can be concluded that lowering the supply voltage is the most effective

factor in lowering the dynamic power consumption. However, the lowering of ddV

increases the circuit delay, which may be represented by the following

dd

delay

dd th

VT k

V V

(11)

Where k is a constant specific to a given technology and depends on the gate size and

capacitance, thV is threshold voltage that is the minimum required voltage, and is the

velocity saturation index of a CMOS circuit within the range of1 2 . Because f and

delayT are inversely related, we can say that

Page 86: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

69

dd th

dd

V Vf k

V

(12)

Eq. (12) reflects that f is linearly related to the supply voltage. That is, the processor

speed is a direct consequence of the supplied voltage [35]. Therefore, by assuming that

avg capP P , Eq. (10) can be rewritten as

2

avg ddP V f (13)

It can also be observed that avgP is an increasing function of f . Let E be the energy

consumed while running a task with an average power avgP at the processor speed of f

for T time units. The abovementioned relationship can be represented mathematically by

the following equation

avgE P f T (14)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Voltage (v)

No

rma

lize

d P

ow

er

(w)

Figure 3.4: Power consumption of Crusoe processor at respective voltage levels [19].

From the aforementioned discussion, we can deduce that an ideal processor would be the

one that can operate on continuous voltage levels. However, due to the switching

overhead, a continuous voltage spectrum is not provided for a CMOS circuit [183].

Therefore, only a discrete number of supply voltage levels are provided that can be

controlled with a DVS technique [184]. In our work, we assume a processor that can

support multiple discrete frequency levels within the range of [0.1, 1.0], with a step of

0.01, where 0.1 is the minimum speed needed to make the peripherals and interrupts

remain powered and active. For our study, we operate within the bounds reported in [166,

19] for a 70 nm Crusoe processor. The bounds and the discrete voltage levels are plotted

Page 87: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

70

for the readers‟ convenience in Figure 3.4, which illustrates the relationship between the

power consumption and the supplied voltage per cycle.

To evaluate the two methodologies, namely the LFS-energy and FFS-energy, from the

point of view of energy savings, we simulate the system under the same arrangement as

previously discussed in Section 3.3.1. For this set of simulations, the speed of the system

was based on Eq. (7) for the FFS-energy technique and Eq. (8) for the LFS-energy

methodology. The workload of the system was the task set within the range of [5, 50],

with an increase of a single task after every 1000th iteration. The author uses MATLAB

as a simulations tool for the results presented in Figure 3.5 of this research work. In

Figure 3.5 the author uses the FFS normalized energy consumption as a benchmark for

comparison.

The voltage and the clock rate required for a successful completion of all the tasks within

the task set is determined by the FFS-energy and LFS-energy techniques. The total

energy consumed within the interval of 0 1,T T is measured by

1

0,

T

avg iT

P t f dt , where

,avg iP t f is the power consumption of the core when executing a task i at the speed of

if for t time units. The system utilization is again kept within the range of ln 2 ,1.0 ,

with a step size of 0.1. That is, the energy values are measured for the task set after a

10% increase in the system utilization. It can be observed from Figure 3.5 that when all

the tasks are schedulable, the savings in energy consumption of both techniques are very

encouraging, up to a certain level. The reason behind this low energy consumption is the

likelihood of the RM feasibility of all the tasks due to low system utilization. Therefore,

the computational demands of the individual tasks are much lower than their respective

periods.

The only difference is that with the FFS-energy approach, the plot trend remains higher

when the task set increases. This is due to the fact that there is a possibility that some of

the task sets may contain some tasks that must be run at a higher speed. Therefore, the

system energy consumption increases as the power function is quadratically proportional

to the system speed.

Page 88: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

71

(a) 0.69.iU (b) 0.8.iU

(c) 0.9.iU (d) 1.0.iU

(n) (n)

(n) (n)

(%j)

(%j)

(%j)

(%j)

Figure 3.5: Normalized energy consumptions for the task set under varying utilizations.

Our proposed LFS-energy technique projects a lower system speed compared to the FFS-

energy approach [46]. This is due to the fact that its implicit characteristic of

continuously searching for the lowest possible feasible speed out of all the possible

speeds, which tends to be never higher than that obtained through Eq. (7). With increased

system utilization, the computational demands of individual tasks increase. Therefore, the

workload also increases in the allowable time window. It can also be observed from

Figure 3.5 that the LFS energy approach also increases the system speed with the

increase in utilization. This is due to the fact that the scheduler must respect all the

deadlines of the tasks. However, the FFS-energy approach is a very reactive procedure as

the first feasible scheduling point might demand high system speed. Therefore, the cores

run at a higher clock rate. In contrast, the LFS-energy approach determines the point

where the minimum task speed is calculated. Therefore, the speed required for the same

Page 89: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

72

task set is much lower. However, the lower system speed identified by the Eq. (8) also

means that the computation demands of the task have now prolonged.

10 20 30 40 501

1.2

1.4

1.6

1.8

Task Set Size (n)

Req

uir

ed

Ex

ecu

tio

n T

ime (

ms)

LFS-Time

FFS-Time

Figure 3.6: LFS and FFS Comparison based on required execution time

The Figure 3.6 compares the FFS approach and LFS approach based on required

execution time as the number of tasks set size increases. It is clear from the results

obtained in Figure 3.6 that the existing technique i.e. FFS [46] outperform LFS in

required execution time. It‟s due to the FFS approach that executes tasks on first feasible

schedulable point and calculates speed on that point while LFS checks all the feasible

points and take the point on which speed is the minimum one. This extra time taken by

the LFS approach is really important. In this research work, the author is more focusing

on power (energy) than required execution time or response time. However, care must be

taken on applying the LFS approach in case of hard real time task. The LFS approach can

easily be applied to soft real time tasks and firm real time tasks where the deadline

missing is not as hazardous as in hard real time tasks. The effect of lowering speed on

tasks execution times is illustrated through an example already presented in this chapter

and Figure 3.1 and 3.2 are based on the example. Also the task deadlines are intact

through the LL-bound and H-bound presented earlier in this chapter. The author uses

MATLAB as a simulations tool for the results obtained in Figure 3.6 presented in this

research work. In Figure 3.6 the author uses the FFS required execution time as a

benchmark for comparison. Based on the above discussion the time complexity of FFS is

( log )m n where m shows the amount of time taken calculating the number of

scheduling points and log n is the amount of time taken to find the first feasible point to

Page 90: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

73

calculate speed at that point. The time complexity of LFS is ( )mn , where m shows the

amount of time taken in calculating the number of scheduling points, and n is the amount

of time taken in calculating speed at every feasible point. The minimum speed among the

calculated speeds is selected. As we know that 1ft

which is of our concern i.e. the

author is interested in lowering the speed (frequency) and hence power (energy) the

relationship of FFS and LFS may becomes like ( ) ( ( ))f LFS g FFS . It means that if

speed is taken as comparison parameter the FFS is the upper bound for the LFS i.e.

0 ( ) ( ( ))f LFS C g FFS and if time is the comparison parameter then all these terms

occurs in reverse order. If speed is the comparison parameter then the above relationship

can also be written as ( ) ( ( ))f FFS g LFS . It means that LFS is the lower bound for the

FFS i.e. 0 ( ( )) ( )C g LFS f FFS .

3.5 Task Partitioning in Multi-core Systems

To avoid testing the schedulability of a task at reduced number of scheduling points,

authors in [185] introduced the concept of false point.

Definition 1. Under a fixed priority scheduling, a point t is termed a false point for a

generic task i , if and only if it satisfies the inequality constraint of iL t t .

The concept of false point is plausible; however, it is inapplicable to DVS-enabled cores

for the following reason.

Theorem 1. Under fixed priority scheduling and multiple system speed levels, a false

point t for i is not necessarily a false point for the lower priority tasks 1,...,i n .

Proof. The proof is presented for 1i that can easily be extended to the case of 2 ,...i n .

As mentioned in Section 3.1, a scheduling point t for i is also the scheduling point for

1i . Let 't be such a point that is present within the set iS and all the subsequent sets

1,...i nS S . If 't is a false point for i that is executing at the speed of if , then,

'1

1' '

i

i j

j j

i

i

tC C

PL t t

f

(15)

Page 91: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

74

Similarly, the workload for 1i at the same point 't that is running at the speed of

'

if can

be expressed as

'

1

1'

1 '

i

i j

j j

i

i

tC C

PL t

f

(16)

When'

i if f , the false point for i also remains the false point for 1i . However, if '

i if f

, then

' '1

1

1 1

'

i i

i j i j

j jj j

i i

t tC C C C

P P

f f

(17)

Which is a contradiction because: (a) '

i if f and (b) the workload due to 1i at 't is

lowered due to the higher value of '

if . Therefore,

'

1

1 '

'!

i

i j

j j

i

tC C

Pt

f

(18)

Which contradicts that 't is a false point for 1i .

First, determine the speed for the execution of task i . Then calculate the core‟s specific

minimum required speed so that all the tasks remain schedulable on the core i with

speed if . This relationship can be captured by the following expression

0

max mini ii n

f f

(19)

Eq. (19) ensures that the core operates on the appropriate speed to execute all the tasks

1,..., k i successfully. Next, we must find the average system speed (uniform speed

for all the cores) to execute tasks on cores. As previously mentioned in Section 3.1,

the aforementioned result (Eq. (19)) is applicable only to a single core system. Therefore,

we must relax some of the assumptions for the multi-core model. The same technique as

discussed in Section 3.1 can also be applied to the whole task set and the tasks can be

mapped on cores, which we detail in the subsequent text.

Page 92: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

75

Let the tuple , ,i i if represent the task set i assigned to core i running at the

speed of if . Because it is preferred to execute all the cores at a uniform speed, the

average system speed must be calculated. In other words,

1

m

i

j

f

fm

(20)

To achieve load balancing among the cores, we adopt a task shifting strategy that

migrates a task from an un-schedulable core to a core with the smallest workload.

Theorem 2. If a task i is shifted from an unschedulable core i to another core j

(wherein both the cores run at the same speed), then the schedulability of i on i

increases by a factor of iC .

Proof. If i is un-schedulable on i at 1,..., ; 1,..., ii j

j

Pt S lP j i l

P

, then

1

1

i

i j i

j j

tC C f t

P

1

1

i

j i i

j j

tC f t C

P

The aforementioned is true because i i if t C f t , iC time units are reduced from core

i by assuming that both cores run at the same speed if .

Theorem 3. If all the cores run at the same speed, then adding a task i to i weakens

the schedulability of i by iC .

Proof. Follows from Theorem 2.

Theorem 4. If all the cores run at the same speed, then no task can be added to the barely

schedulable core.

Proof. Let i be a task such that in addition to the already schedulable 1i tasks, is

schedulable on a barely schedulable core i . The term barely schedulable refers to a

system in which only 0 1i i slots are available on a core i , i.e.,

Page 93: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

76

1

1

i

j i

j j

tC f t

P

By adding i to the aforementioned, we obtain

1

1

i

i j i

j j

tC C f t

P

,

1

1

i

i j i

j j

tC C f t

P

Because i if t C , we get

1

1

i

i j i

j j

tC C f t

P

(21)

Which shows that the available slot is small enough to accommodate the task i .

However, the aforementioned claim contradicts the assumption that i is barely

schedulable on i .

There are two possible cases to balance the load among all the cores of the underlyin g

system, which we detail below.

Case 1. i

f f : Because all the computation times are proportionate to the core

speed, a lower speed core would prolong the task computations. Therefore, the task set

i that was previously feasible at the speed ofi

f on core i , now becomes infeasible at

the speed of f . Care must be taken when migrating tasks from i to another core j . We

must find the most underutilized core among all the cores that are operating at the speed

f , which also offer space to accommodate more tasks. Once the particular core is

identified, the task shifting (or migration) process begins. Let core i be the task donor

and core j be the task acceptor, i.e.,

, , :ii i l i l i q if l U U | 1,..., ;q k q l . In our system, if a task i i is

to be shifted to j , then the task with the lowest utilization on i is chosen as the

candidate for shifting. The process continues until utilization of all the cores is leveled.

Page 94: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

77

This arrangement guarantees to meet all the task deadlines and to run all the cores at the

same speed.

Case 2. i

f f : In this case, the task subset i is feasible on i and the core might

be underutilized at the speed of f . As mentioned above, i can accommodate more

tasks that are assigned to other cores within the system.

Once the core utilization is balanced among all the cores under RM scheduling, a uniform

speed f is recalculated. This uniform speed mandates that all the tasks are schedulable

and allows the system to operate at the lowest possible speed. Therefore, the overall

system power consumption is also reduced. In other words, it is the core utilization that

decides the system speed and not the number of tasks. This is due to the fact that there

may exist a core i that has a higher number of tasks than those assigned to a core j ,

while j iU U . Therefore, j must operate at a higher speed than i .

3.6 Task Mapping on Cores

By applying LFS strategy discussed earlier in this chapter, it may not ensure that load on

all computing units (core or system) will be the same. In-order to equally utilize all the

computing units, load balancing is required. In this section, we generate the task set with

the same procedure as previously described in Sections 3.3.1 and 3.3.2. Initially, we start

with a task set of size 120 to observe its mapping on 8 cores. We plot these results in

Figure 3.7, which are categorized into two domains: (i) Figure 3.7(a) and (b) depict the

results before applying the proposed strategy and (ii) Figure 3.7(c) and (d) show the

results after applying the proposed technique. The author uses MATLAB as a simulations

tool for the results obtained in Figure 3.7 and Figure 3.8 presented in this research work.

In Figure 3.7 and Figure 3.8, the author wants to investigate the task mapping to

computing units and utilization of each computing unit before and after applying lightest

task migration strategy and no benchmark is used.

Page 95: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

78

(c) Utilization of cores (after balancing).

(a) Utilization of cores (before balancing). (b) Tasks mapping to cores (before task shifting).

(d) Tasks mapping to cores (after balancing).

(n) (n)

(n) (n)

(n)

(n)

(%)

(%)

Figure 3.7: Load distribution on system with 8 cores.

Figure 3.7(a) reflects the case, where a task set of size 120 is distributed over 8 cores and

some cores are heavily utilized than others. For instance, the utilization of core 7 is the

highest among all the 8 cores while 8 has the lowest utilization. Similarly, Figure 3.7(b)

shows the corresponding number of tasks, where core 2 has the highest number of tasks,

while core 8 has the lowest number of tasks. The load is balanced on the basis of core

utilization; therefore, some of the tasks are shifted from core 7 to the other cores.

Because we have used the necessary and sufficient conditions in our work, the tasks are

Page 96: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

79

assigned to the cores based on the exact feasibility analysis. That is, when a core, say 1 ,

is assigned a certain number of tasks, the rest of the tasks are mapped onto the next core

2 . The same is observed from Figure 3.7, where the cores 1 through 7 are fully utilized

while core 8 remains underutilized. This is due to the fact that fewer tasks are left for

core 8 . It can also be deduced from Figure 3.7(a) and (b) that 3 has a utilization of

80%, while the total number of tasks assigned is only 12 (2nd

lowest after core 8 in the

system, see Figure 3.7(b)). After applying our proposed technique, the results are plotted

in Figure 3.7(c) and (d). The core utilization is almost the same; however, the number of

tasks assigned to the cores is not uniform (see Figure 3.7(d)).

(a) Utilization of cores (before balancing). (b) Tasks mapping to cores (before task shifting).

(C) Utilization of cores (after balancing). (d) Tasks mapping to cores (after balancing).

(n) (n)

(n) (n)

(n)

(n)

(%)

(%)

Figure 3.8: Load distribution on system with 12 cores.

Page 97: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

80

We further increase the number of the cores to 12 and distribute the workload of the task

set of size 190. The corresponding results are shown in Figure 3.8. Although we have

applied a heavy system load, it can be seen from Figure 3.7(a) and (c), and Figur e 3.8(a)

and (c) that the utilization of a core never reaches 100%, which is due to the implicit

characteristics of the RM scheduling algorithm. Figure 3.8(a) and (b) report the

utilization and the task mapping of cores before load balancing, while Figure 3.8(c) and

(d) plot the results after applying the task shifting technique. From Figure 3.8(a), we can

deduce that core 9 and core 10 are heavily utilized, while Figure 3.8(b) reports that

core 10 and core 11 have the maximum number of tasks assigned. It is worth

mentioning that the task shifting is performed in such a way that the lightest task among

all the assigned tasks to the maximum utilized core is shifted to the minimum utilized

core. This arrangement: (i) results in minimum possible workload shifting from the

higher to a lower utilized core and (ii) does not violate the timing constraints of the

already assigned tasks to the cores. It can be observed from Figure 3.8(b) and (d) that 15

tasks from other cores are shifted to core 12 under the load balancing mechanism.

Interestingly, it can also be observed from Figure 3.8(c) that core 7 has the highest

number of tasks (26 in total), while core 7 , 9 , and 10 have the lowest number of tasks

(12 each). However, the utilization of all the cores is almost the same as reported in

Figure 3.7(c). Moreover, initially, core 12 was under utilized as can be observed from

Figure 3.8(a) and had the lowest number of task assignments. Because of the shifting of

the lightest task from other cores, the task assigned to core 12 is the highest (26 as can

be seen from Figure 3.8(d)). However, the utilization is balanced with the remaining

cores (see Figure 3.8(c)). Therefore, all the cores can now be run on a uniform speed,

which was the intention of this work.

As reflected in Figures 3.7(c, d) and 3.8(c, d), any further task shifting is not possible

until task splitting techniques are applied. Since we do not consider the task splitting case

here, there might be situations of uniform utilization and hence uniform speed will not be

possible. In such cases, the speed assigned to the system is the speed of the core that is

highly utilized.

Page 98: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

81

3.7 Conclusion of the Chapter

Frequency (speed) is one of the factor through which power can be minimized. In the

Chapter a new mechanism named Least Feasible Speed (LFS) were proposed for power

reduction and obtained results were compared with its existing counterpart called First

Feasible Speed (FFS) [46]. The obtained results reveal the speed obtained through LFS is

low as compared to FFS and hence power (energy). A lowest single core speed is

calculated for each core and then average system speed is calculated. If all the cores runs

at the average speed then the tasks that were feasible at high speed of a core may now

becomes infeasible at average speed. Therefore, a lightest task migration strategy in the

Chapter is proposed to equally utilize the load among cores and hence run all the cores at

average speed for power reduction.

Page 99: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

82

Chapter 4 Power Efficient Resource Allocation Using Genetic

Algorithm

Page 100: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

83

4.1 Introduction

Mathematical formulation and results reported in the previous chapter confirmed that the

speed obtained through and hence power (energy) i.e., .

Genetic algorithms are classified as meta-heuristics and applied when more formal

optimizations are intractable and difficult to solve. In this research genetic algorithm is

applied on FFS values for improving speed and power as well. The process of GA is:

initially offspring‟s fitness values are calculated and some offsprings are selected by

using any selection method. The selection method can be random, roulette wheel or

tournament. The selected offsprings then passes through cross over and mutation phases.

Finally, the fitness values of the new offsprings are calculated. After cross over and

mutation, only those genes retain in new population whose new fitness value is better

than old fitness value and thus the process of optimization takes place.

It is clear from the results obtained in previous chapter that the speed and hence the

power obtained through LFS is the optimal one. In this chapter the author attempts to

further investigates and identify mechanism for power minimization. To do this, the

author apply genetic algorithm to FFS that presents very interesting results not only in

terms of speed but also in terms of time as well. This chapter not only presents a novel

approach i.e., GA-FFS but also compares the two proposed algorithms (LFS and GA-

FFS) of the author with existing counterpart i.e., FFS in terms of speed and time. The

chapter presents the tradeoff between time and speed and gives fruitful results that if time

is your main consideration use FFS approach for fast response but if you give more

attention to power then use LFS and use GA-FFS in case of moderate power and time is

required.

4.2 Proposed Work

The working process of GA-FFS is clearly depicted in Figure 4.1. The upper red

rectangle in Figure 4.1 represents the process of FFS. Initially, scheduling points are

calculated for a generic task through the mechanisms given in previous chapter. Then

a workload is calculated and checked that whether the load (of this task plus other higher

priority tasks) is feasible or not. The feasibility of task is checked through Eq. (5). If

the task is feasible at a scheduling point , then the speed of task is calculated

LFS FFSpower powerLFS FFS

i

i

i t i

Page 101: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

84

thorough Eq. (7) and further scheduling points are discarded. This speed is called FFS.

However, if the task is not feasible at the scheduling point, then another scheduling

point is taken and the process continues until first feasible point reached.

Figure 4.1: Flow chart of GA-FFS

i

Page 102: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

85

FFS for every task in task set is calculated and it becomes initial

population in GA. The lower red rectangle in Figure 4.1 shows the process of genetic

algorithm. The FFS values behave phenotype in the genetic algorithm and these values

must be converted into genotype for further processing. In other words, the FFS values

are in decimal form therefore we must have to convert it into binary form for further

steps of genetic algorithm.

Fitness values of all the offspring in the initial population are calculated. The fitness

function in our case is:

Such that

and 1i if f (22)

By we mean minimization of FFS such that the given conditions are satisfied.

The is the first feasible speed and can be calculated through Eq. (7).

Randomly selected offspring

from Population

Fitness value (

) Tournament Selection between (1,

2) and (3, 4) Offsprings

0.50

0.55

0.20

0.40

Figure 4.2: Process of Tournament Selection

Once fitness values are calculated, next step in genetic algorithm is selection of

offsprings. There are many ways for selection like random selection, tournament

selection and roulette wheel selection [51]. Every selection method have its merits and

1 2{ , ,..., }n

min if

1

1min max

i

i

i i

j j

it S

tC C

Pf

t

min if

if

if

0 1 0 0 1 0 1 0 0 1

0 1 0 0 1 0 1 0 0 1

0 1 1 0 1 0 1 0 1 0

0 0 0 0 1 0 0 0 0 1

0 0 0 0 1 0 0 0 0 1

1 1 1 0 0 0 1 0 0 0

Page 103: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

86

demerits. We use tournament selection, as the notion of genetic algorithm is “survival of

the fittest” [53]. In selection phase, four offsprings are randomly selected from the initial

population and then tournament selection is applied to select two offsprings for cross

over phase of genetic algorithm. For tournament selection, two offsprings must be

needed. In tournament selection, the offspring having minimum speed (as our problem is

minimization problem) as selected to become parent. The above Figure 4.2 can pictorially

represent the tournament selection.

4.2.1 Main Drivers of Genetic Algorithm

The main drivers of genetic algorithm are cross over and mutation. Cross over always

occurs between two offsprings i.e., for cross over process two offsprings are required,

while mutation process occur in a single offspring.

4.2.1.1 Cross Over

It is one of the main drivers in genetic algorithm. Just like natural phenomenon, parents

meet and produce offsprings or children. The offsprings have some features of parents

while they also possess some features of their own. The features that are transferred from

parents to offsprings are due to cross over process. Cross over occurs between two

offsprings. A cross over can occur either at a single point or at multiple points. The cross

over point may fix or may vary. In our case we are using single point cross over but the

cross over point is not fixed. The cross over process can be pictorially represented in

Figure 4.3.

(a) One point cross over

(b) Multi point cross over

Figure 4.3: Cross-over Process

0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 1 1 1 1 1

1 1 1 1 1 0 0 0 0 0

Offsprings before

One Point Cross Over

Offsprings after

One Point Cross Over

0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1

0 0 0 1 1 1 1 0 0 0

1 1 1 0 0 0 0 1 1 1

Offsprings before

Multi Points Cross Over

Offsprings after

Multi Point Cross Over

Page 104: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

87

4.2.1.2 Mutation

This is another driver of genetic algorithm. Through cross over traits are transferring

from parents to offsprings while mutation is responsible for the offspring‟s own traits.

Mutation means a slight change in offspring. Mutation occurs after cross over. Mutation

occurs in a single offspring that there is no need of offspring pairs. Mutation can be

bitwise (multi point) or it can be single bit (just one point). Sometimes mutation did not

occur in offspring. Normally the mutation probability is kept very low. If the mutation

probability is high then there are high chances of convergence towards objective

function. Mutation is used to trap out from local minima. In our case, we are using just

single point mutation and the mutation probability is 0.1. For mutation we randomly

select a mutation point and then check the probability that whether mutation occurs at

this point or not. If mutation can occur, the bit is changed accordingly and leaves the

offspring as it is, if mutation is not possible. Mutation can be pictorially represented in

the following Figure 4.4.

(a) Random single point mutation

(b) Bitwise mutation

Figure 4.4: Mutation Process

In random single point mutation a point is selected randomly and probability is checked

only once, that mutation will occur or not. While in bitwise mutation, the probability is

checked at each bit to mutate the bit or retain it as it is.

After cross over and mutation processes, the fitness values of the new offsprings are

calculated through Eq. (9). Old offsprings are replaced with new offsprings if and only if;

(23)i iNew f Old f

The new fitness value of an offspring is less than the old fitness value of the same

offspring and also fulfils both constraints of Eq. (22).

0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

Offspring before

Random single point mutation

Offspring after

Random single point mutation

Page 105: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

88

The offsprings whose new fitness value is not less than its old fitness value, are retain in

population for further iterations of GA.

By summing the above discussion, FFS approach calculates through Eq. (7) and LFS

through Eq. (8). GA-FFS improves the obtained through FFS by using above drivers

of GA such that the new speed is not more than that obtained through LFS. Hence, we

can easily conclude that obtained through .

4.2.2 Feasibility Checking Through GA-FFS Approach

By rearranging Eq. (7) and Eq. (8), a task can become feasible at FFS if and only if;

1

1min (24)

i

i

i j

j j

t Si

tC C

Pt

f

And a task become feasible at LFS if and only if;

1

1min max (25)

i

i

i j

j j

t Si

tC C

Pt

f

We know the obtained through LFS, FFS and GA-FFS is like .

We also know that as value decreases, the required execution time increases. It is also

clear from the discussion in previous chapter that if a task is feasible at FFS, it may also

be feasible at LFS. Therefore, from the above discussion we can conclude that

. Hence, if a task is feasible at LFS it must be feasible at

GA-FFS as well.

Algorithm 1: GA-FFS

Input: FFS values

Output: Set of values

Steps:

1. Calculate individual of all the offsprings in initial population (FFS)

2. FOR epoch:1 TO Total number of epoch

if

if

if LFS GA FFS FFS

i

i

if LFS GA FFS FFS

if

time time timeFFS GA FFS LFS

min if

if

Page 106: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

89

i. Tournament selection

Select randomly where and arrange tournament between (1,

2) and (3, 4).

Select those as parent whose opponent // Handle ties

arbitrary ii. Cross over

Select where

Cross over the two parents at .

iii. Mutation

Select where

IF lies in the range of

mutate the bit at

ELSE

retain the bit as it is

End IF

iv. Calculate of the two new

IF // Handle ties arbitrary

Replace old with new

ELSE

Retain the old as it was (before cross over and mutation)

End IF

epoch = epoch +1

End FOR

EXIT

4.3 Experimental Results and Analysis

For experimental results and analysis the three techniques FFS [46], GA-FFS and LFS

(proposed in Chapter 3) were compared. To compare the three techniques, random task

sets of sizes within the range of [5, 50] were generated, with a step size of 1. The plots

reported in this Chapter are the average values of 300 runs of all the task sets 5 through

50. The task periods were randomly generated from a uniformly distributed range of

[100, 10,000]. To obtain the corresponding task execution demands iC for i , random

values were taken from within the range of 1, iP , also with uniform distribution. The

priorities were assigned to the tasks as per RM scheduling rules. That is, the smaller the

task period, the higher is the task priority. To have a feasible RM schedulable task set,

initially, we keep the system utilization at 0.69 or ln 2 , which is quite low.

iOff 4

1i Off

Offif if

p 1 ( )p size Off

Off p

p 1 ( )p size Off

mutprob mutthreshold

p

if Off

( ) ( )i iNew f Old f

Off Off

Off

Page 107: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

90

0 10 20 30 40 500.5

0.6

0.7

0.8

0.9

1

Task Set Size (n)

Re

qu

ire

d S

pe

ed

(%

)

FFS

GA-FFS

LFS

0 10 20 30 40 500.5

0.6

0.7

0.8

0.9

1

Task Set Size (n)

Req

uir

ed

Sp

eed

(%

)

FFS

GA-FFS

LFS

at 150 epoch (b) at 400 epoch

Figure 4.5: Task set size against Required Speed (LFS, FFS, GA-FFS)

In this section FFS, LFS and GA-FFS are evaluated experimentally. Initially we plot the

speeds results of FFS, LFS and GA-FFS in Figure 4.5. It is clear from Figure 4.5 that FFS

runs at higher speed than GA-FFS. Therefore, GA-FFS performs better than FFS. The

Figure 4.5 (a) results are based on 150 epochs and Figure 4.5 (b) results are based on 400

epochs for GA-FFS. Some major findings based on Figure 4.5 (a) and Figure 4.5 (b) are

noted, that are given below:

GA-FFS performs better than FFS when speed is taken as testing attribute. FFS

uses Eq. (7) for speed calculation. The output of this equation is the input for GA-

FFS. As genetic algorithm is an optimization algorithm, it also applies cross over

and mutation. In addition, the objective (fitness) function of GA-FFS is

.Therefore; GA-FFS improves the results of FFS. Hence, GA-FFS outperforms

than FFS when speed is taken into consideration. The Figure 4.5 also presents the

supremacy of LFS over FFS and GA-FFS, if speed is the testing criterion. In other

words, LFS algorithm runs a system at less speed than FFS and GA-FFS. It should

be noted that LFS is efficient when speed is under consideration [2]. Therefore,

author set a constraint on GA-FFS that the speed obtained through GA-FFS

must not be less than that of LFS. As shown in Eq. (22)

1

1min max (26)

i

i

i j

j j

it S

tC C

Pf

t

min if

if

Page 108: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

91

0 10 20 30 40 50

0.4

0.5

0.6

0.7

0.8

0.9

1

Task Set Size (n)

No

rmal

ized

En

erg

y C

on

sum

pti

on

(%

j )

FFS-energy

GA-FFS-energy

LFS-energy

0 10 20 30 40 50

0.4

0.5

0.6

0.7

0.8

0.9

1

Task Set Size (n)

No

rmalize

d E

nerg

y C

on

su

mp

tio

n (

% j )

FFS-energy

GA-FFS-energy

LFS-energy

(a) at 150 epoch (b) at 400 epoch

Figure 4.6: Energy Consumption against Task set size for LFS, FFS and GA-FFS

Figure 4.6 shows the results of energy consumption by FFS, GA-FFS and LFS

algorithms. The Figure 4.6 (a) results are based on 150 epochs and Figure 4.6 (b) results

are based on 400 epochs for GA-FFS. Some major findings based on Figure 4.5 (a) and

Figure 4.6 (b) are noted, that are given below:

GA-FFS consumes less power (energy) than FFS. It is clear from previous results

that the speed obtained through GA-FFS is less than FFS. We also know that

and . When we put the speed values of FFS and GA-FFS in

these two equations, the power (energy) value for GA-FFS will outperform FFS

value. It is also clear from the results that LFS consumes less power (energy) than

FFS and GA-FFS. As clear from the findings of Figure 4.5 (a) and Figure 4.5 (b).

0 10 20 30 40 501

1.2

1.4

1.6

1.8

2

Task Set Size (n)

Req

uire

d E

xecu

tion

Tim

e (m

s)

LFS-Time

GA-FFS-Time

FFS-Time

0 10 20 30 40 501

1.2

1.4

1.6

1.8

2

Task Set Size (n)

Re

qu

ire

d E

xe

cu

tio

n T

ime

(m

s)

LFS-Time

GA-FFS-Time

FFS-Time

(a) at 150 epoch (b) at 400 epoch

Figure 4.7: Execution Time against Task set size for LFS, FFS and GA-FFS

f

E P T 2P V f

Page 109: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

92

Figure 4.7 depicts required execution time of FFS, GA-FFS and LFS. It is clear from

Figure 4.7 that GA-FFS defeat LFS when required execution time is the testing

parameter. Figure 4.7 (a) results are based on 150 epochs and Figure 4.7 (b) results

are based on 400 epochs for GA-FFS. Some major findings based on Figure 4.7 (a)

and Figure 4.7 (b) are noted, that are given below:

As clear from the results obtained in Figure 4.7 the GA-FFS defeats LFS in

required execution time. As LFS checks all scheduling points for finding least

feasible speed of every task, hence consumes more execution time. While GA-FFS

uses the first feasible scheduling point, for every task and then applies genetic

algorithm to improve the results. As we assume the genetic portion‟s execution

time of GA-FFS takes constant time equal to number of iterations

(epochs/population), therefore, GA-FFS defeats LFS in required execution time. It

also represents that FFS defeat GA-FFS and LFS in required execution (response)

time. As clear from the previous finding, GA-FFS defeat LFS in execution time.

The output of FFS becomes input for GA-FFS for further operations of genetic

algorithm that take constant time equal to number of iterations

(epochs/population), as assumed. Therefore, FFS defeats GA-FFS and LFS in

required execution time.

A task is feasible using GA-FFS i.e., a task completes in its deadline, if GA-FFS

is used. As clear from the algorithm itself and can be deduced from previous

findings that the required execution time of GA-FFS is less than LFS i.e.,

and also the results obtained in previous chapter reveal that

if a task is feasible at FFS, it may also be feasible at LFS. Therefore, from the

above discussion we can conclude that . Hence, if a

task is feasible at LFS, it may also be feasible at the speed obtained through GA-

FFS. The results also represents that Genetic algorithm proves the “survival of the

fittest” notion. “Survival of the fittest” [53] is the notion of Darwin‟s theory and of

genetic algorithm. As clear the algorithm itself and also from Figure 4.5, Figure

4.6 and Figure 4.7 that, when the numbers of epochs are less, the GA-FFS behaves

like FFS and when the number of epochs increases, the GA-FFS behaves like LFS

i.e., as the number of iteration increases the more fit values are obtained. It is also

time timeGA FFS LFS

time time timeFFS GA FFS LFS

Page 110: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

93

clear from the obtained results that values obtained through GA-FFS are non-

decreasing. As clear from the second constraint of the fitness function, that

. Therefore, the values obtained through GA-FFS are non-decreasing.

Based on the above discussion the time complexity of FFS is ( log )m n where m shows

the amount of time taken calculating the number of scheduling points and log n is the

amount of time taken to find the first feasible point to calculate speed at that point. The

time complexity of LFS is ( )mn , where m shows the amount of time taken in

calculating the number of scheduling points, and n is the amount of time taken in

calculating speed at every feasible point and finally select minimum speed among the

calculated speeds. The time complexity of GA-FFS is ( log )m n k , where logm n is the

amount of time taken by the FFS and k is the amount of time taken by the genetic

algorithm portion of the GA-FFS algorithm. The amount of k depends on the number of

iterations (epochs/population) in genetic algorithm. If the value of k is high enough to

the amount of n then the GA-FFS will take more time than LFS and if the value of k is

small then GA-FFS will take less time than LFS. However existing counterpart the FFS

[46] beats our both proposed approaches if time is the comparison parameter. As we

know that 1ft

which is of our concern i.e. the author is interested in lowering the

speed (frequency) and hence power (energy), the relationship of FFS and LFS may

becomes like ( ) ( ( ))f LFS g FFS . It means that if speed is taken as comparison

parameter the FFS is the upper bound for the LFS i.e. 0 ( ) ( ( ))f LFS C g FFS and if

time is the comparison parameter then all these terms occurs in reverse order. If speed is

the comparison parameter then the above relationship can also be written as

( ) ( ( ))f FFS g LFS . It means that LFS is the lower bound for the FFS i.e.

0 ( ( )) ( )C g LFS f FFS .

4.4 Conclusion of the Chapter

In the Chapter a new mechanism for lowering system speeds and hence energy is

proposed. The new mechanism is termed as GA-FFS, as clear from its name the

mechanism takes the values obtained through FFS [46] as input and apply genetic

algorithm to further improve the speed of single tasks and hence for whole computing

if

1i if f if

Page 111: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

94

unit. The results obtained through GA-FFS are compared with our own proposed

mechanism (LFS discussed in Chapter 3) and with existing counterpart the FFS [46]. The

results reveal that GA-FFS improve the results of FFS. The LFS is optimal solution so

with increase in number of epochs (new populations) the GA-FFS will behave like LFS

while as the number of epoch decreases the GA-FFS behave like FFS.

Page 112: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

95

Chapter 5 Resource Allocation Using Load Balancing Mechanisms

Page 113: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

96

5.1 Introduction

By applying LFS and GA-FFS strategies discussed in Chapter 3 and Chapter 4, it may not

ensure that load on all computing units will be the same. “The load imbalance (especially

in the many-core processors) is a major source of energy (power) drainage.” [62]. In-

order to equally utilize all the computing units, load balancing strategies are applied.

There is a huge amount of literature on load balancing. All the load balancing strategies

are broadly categorized into two types: one is called static [186] load balancing and the

second one is called dynamic [187, 188] load balancing. Static load balancing has

statistical information of application and uses it for load balancing. D. Grosu et al. [189]

formulated the problem of static load balancing. Dynamic load balancing mechanisms

only use the current state of the system for load balancing. There are three ways for the

solutions of dynamic load balancing problem. The three ways are global, cooperative and

non-cooperative. In global method of load balancing there is only one dedicated machine

for load balancing. In cooperative approach few dedicated machines work cooperatively

for overall system load balancing. In non-cooperative method, every system balances its

load individually. Kameda et al. [190] developed some algorithms for load balancing in

non-cooperative games. Our approach is hybrid between the cooperative and non-

cooperative approach. Our approach acts like non-cooperative approach on single core

and for overall load balancing our approach uses cooperative method of dynamic

approach as we are interested in the overall load balancing of a system.

Andrey G et al. [191] shows a gradient decent algorithm for load balancing. In gradient

decent method of load balancing a specific load gradually balances between systems or

cores. Gradient decent method is just like moving down from hill i.e., load from high

utilized system slowly transfers to a low utilized system till the overall load balances.

Some of the variations of the gradient decent algorithm are found in [192, 193, 194]. Our

strategy of lightest task migration and task splitting resembles to the gradient decent

algorithm.

In this chapter of the research work, the author proposes and experimentally evaluated

two mechanisms for load balancing among cores or computing units. The first one is

called lightest task migration strategy and the second one is called task splitting strategy.

In the lightest task migration strategy, the task having minimum utilization from the

Page 114: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

97

highly utilized core is transferred to a low utilized core. While in task splitting strategy, a

task is split among cores in such a way that cores utilization becomes equal. The

abovementioned strategies are discuss in the subsequent section.

5.2 Load Balancing Mechanisms

In this section two mechanisms are discussed for load balancing. The first one is task

migration or task shifting and the second one is task splitting strategy. Before applying

these load balancing strategies; first ly, tasks from a task set are assigned to different

cores such that the tasks are feasible on that cores. The task set 1 2, ,... n consists of n

tasks and can be divided into subsets such that 1 2, ,..., n , where 1 1 2, ,... k and

2 1 2, ,...k k i and so on. Moreover, a set of cores 1 2, ,..., m , m n is

available. The individual task i utilization at a specific core is given as 1

ki

i ii

CU

P . Where

iC is the execution time needed by a task and the iP is the period of the task.

The problem that we are addressing is to map over such that tasks are feasible on

each core and then balance the load among all cores through task shifting and task

splitting strategies. For feasibility checking first, cumulative work load of task i is

calculated through Eq. (27) at any time instance t . After calculating cumulative work load

feasibility is checked through Eq. (5).

1

1

i

i i j

j j

tL t C C

P

(27)

A task i is always feasible on a generic core i at any instance of time t if and only if

Eq. (5) hold true. Where t is a scheduling point and iS is the scheduling points set and is

calculated through:

1,..., ; 1,..., ii j

j

PS lP j i l

P

(28)

On each scheduling point cumulative work load is calculated and if the work load is

feasible the task is assigned to a core and if it is not feasible then the task is assigned to

Page 115: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

98

another core. Following are some assumptions that should be kept in mind for

explanation of task splitting strategy.

i. A task can be split to any number of parts.

ii. The splitting of a task does not affect the overall results.

iii. All tasks are independent.

There are two types of tasks dependencies. One is inter-tasks dependency and the other is

intra-task dependency. The primal focus of the author in this chapter is load balancing

mechanisms and not to indulge in task dependencies. Authors in ref [16, 97] are also not

considering task dependency. Therefore, for the sack of simplicity, the author of the

research work is also neither considering the intra-task dependency (assumption ii) nor

the inter-tasks dependency (assumption iii).

5.2.1 Task Migration or Task Shifting

After assigning tasks to cores, the next step is to balance the load among cores. Task

migration or task shifting is one strategy used for load balancing. In this strategy, a core

having maximum utilization i.e., having maximum load is selected and on that core a task

having minimum utilization is selected for shifting. Task utilization is calculated through:

ii

i

CU

P (29)

A task having low utilization is shifted from a highly utilized core to a low a utilized core

and the process is repeated until utilization of all cores becomes approximately equal to

the average utilization of all cores. Nevertheless, the utilization of cores in this strategy is

not necessarily equal because the utilization of the tasks is also different, and this leads to

unequal utilization amongst the cores even after shifting the lightest tasks. The lightest

task is selected for shifting to balance the load among cores gradually. If a task having

maximum utilization is selected for shifting then there may be greater fluctuation of

balancing load among cores i.e., the load on cores will quickly increase and decrease

from the average utilization.

5.2.2 Task Splitting

This is another strategy used for load balancing among cores. Task shifting strategy does

not guarantee equal load among all cores as compared to task splitting. In task splitting

Page 116: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

99

strategy iC is the only parameter to play with, because the parameter iP is constant and

we can not change it. Task splitting strategy balances load, among all cores by splitting

the iC parameter into two parts in such a way that core‟s utilization becomes equal after

assigning one part to one core and the second part to another core. Task splitting strategy

is more time consuming as compared to task shifting policy because in task splitting extra

time is required to split a task into two parts and then transfer a part of the split task to

another core for balancing load.

Task splitting strategy can be implemented in two ways: i) Assign tasks to cores and

apply task splitting strategy directly. ii) First apply the task shifting strategy and then

apply task splitting. The second way of implementation is less time consuming as

compared to the first because after applying task shifting strategy t he cores are

approximately balanced as compared to the first choice. We apply the second way of

implementation for task shifting. The results of task shifting and task spl itting are

discussed in Section 5.2.

5.2.3 Explanation Through an Example

Let‟s take three tasks 1 1.2,6 , 2 4,8 and 3 1.2,6 . Here our focus is the load

balancing and not the feasibility of tasks therefore, we assume that tasks 1 and 2 are

feasible on core 1 ( 1 ) and 3 is feasible on core 2 ( 2 ) and assigned to respective cores.

The individual task‟s utilization is calculated through Eq. (29) and are 0.2, 0.5 and 0.2 for

1 , 2 and 3 respectively. The overall core utilizations are 0.7(0.2 + 0.5) of 1 and 0.2 of

2 and are given at the top of each core in Figure 5.1.

Task Shifting Task Splitting

Figure 5.1: Load balancing mechanisms (Task shifting and Task Splitting) among two cores.

Page 117: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

100

Now for task shifting policy the high-utilized core is 1 and lightest task having

minimum utilization at 1 is 1 . Therefore, 1 of 1 is selected and shifted from 1 to 2

. After task shifting the total utilization of cores is depicted in Figure 5.1 and are 0.5 and

0.4 for 1 and 2 respectively. Now the cores utilization is just near to balance but not

fully balance. In order to fully balance the cores utilization, task-splitting policy is

applied. For task splitting policy, the only factor is iC that we have to play with. The iC

can be split through the following procedure.

Average utilization of a core is equal to total utilization of all tasks on cores divided by

number of cores i.e., totUAvg

n . So the average utilization of a core is:

0.5 0.4 0.452

Avg . Now the difference among average core utilization and

actual core utilization is 5 0.45 0.05 or 0.45 0.40 0.05 . Now divide the iP by the

difference value (0.05) to calculate how many parts we have to divide the iC for full

balancing of load among cores. The number of parts that iC will have to be divided is:

i

val

UDiff

. So number of parts: 0.5/ 0.05 10 . Next, divide the iC of that task by the

number of parts in order to determine the portion of iC that has to be transferred to

another core as: iCPartsNo

4 0.410

. So the iC is split into two portion one portion

will be 0.4 and another portion will be 3.6.

Now the utilizations of split portions are: 0.4 0.058 (this portion will have to be

transferred from one core to another) and 3.6 0.458 (this portion will retain on the same

core). In this way all the load is fully balanced among the cores. The above mentioned

policies are pictorially depicted in Figure 5.1.

5.3 Results and Discussions

This section shows the results of the strategies discussed in previous section. Matlab is

used for simulation of the above discussed strategies. To compare both techniques,

random task sets of sizes within the range of [60, 80] were generated. The task per iods

were randomly generated from a uniformly distributed range of [100, 10,000]. To obtain

Page 118: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

101

the corresponding task execution demands iC for i , random values were taken from

within the range of 1, iP , also with uniform distribution. The priorities were assigned to

the tasks as per RM scheduling rules. That is, the smaller the task period, the higher is the

task priority. After task set generation, the tasks are assigned to cores and then the

discussed strategies are applied.

1 2 3 40

5

10

15

20

Cores (n)

Num

ber

of T

asks

Bef

ore

(n)

Figure 5.2: Number of tasks on cores before load balancing.

Figure 5.2 shows assigned tasks to cores, before applying any of the loads balancing

strategy. It shows that before load balancing core 3 has maximum tasks which are 19 and

core 2 and core 4 have 13 numbers of tasks and remaining tasks are assigned to core 1 as

depicted in Figure 5.2.

1 2 3 40

0.2

0.4

0.6

0.8

1

Cores (n)

Uti

lizat

ion

Bef

ore

(%

)

Figure 5.3: Cores utilization before load balancing.

Figure 5.3 depicts utilization of cores corresponding to the tasks assigned in Figure 5.2. It

is clear from Figure 5.3 that core 3 has maximum utilization which is 0.8149 and core 4

has minimum utilization which is 0.6704 before load balancing.

Page 119: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

102

1 2 3 40

5

10

15

20

Cores (n)N

um

ber

of

Ta

sk

s A

fte

r S

hif

tin

g (

n)

Figure 5.4: Number of tasks on cores after task shifting.

Figure 5.4 shows the number of tasks on the four cores after applying task migration or

task shifting strategy. Figure 5.4 depicts that 7 tasks are shifted to core 4 from core 1 and

core 3 so now core 4 has maximum tasks which are 20. It should be clear that in task

shifting the lightest task is shifted from a high utilized core to a low utilized core. Core 2

also gains 2 tasks from core 1 and core 3. After shifting tasks core 3 has minimum task

numbers which is 12 as depicted in Figure 5.4

1 2 3 40

0.2

0.4

0.6

0.8

Cores (n)

Uti

lizati

on

Aft

er

Sh

ifti

ng

(%

)

Figure 5.5: Cores utilization after task shifting.

Figure 5.5 depicts utilization of cores corresponding to the number of tasks in Figure 5.4.

It is clear from the following Figure 5.5 that after applying task shifting strategy all the

cores utilization are not fully equal but approximately equal.

In Figure 5.5 core 1 has minimum utilization as compared to other cores which is 0.7382

and core 2 has maximum utilization which is 0.7550.

Page 120: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

103

1 2 3 40

0.2

0.4

0.6

0.8

Cores (n)

Uti

lizati

on

Aft

er

Sp

litt

ing

(%

)

Figure 5.6: Cores utilization after task splitting

In order to make the utilization of all cores equal, task splitting strategy is applied. The

result of task splitting is depicted in Figure 5.6 and is clear from the figure that all cores

have now equal utilization which is 0.7453.

Although task splitting leads to equal cores utilization but is more time consuming than

task shifting strategy.

Table 5.1: Overall simulation results

Task

Set

Size

No of cores

for full

feasibility of

task set

Task on each

core before

shifting and

splitting

Load on each

core before

shifting and

splitting

Tasks on

each core

after tasks

shifting

Load on

each core

after tasks

shifting

Load on

each core

after tasks

splitting

60 4

C1: 15

C1: 0.7834

C1: 13

C1: 0.7382

C1: 0.7450 C2: 13 C2: 0.7125 C2: 15 C2: 0.7750 C2: 0.7450

C3: 19 C3: 0.8149 C3: 12 C3: 0.7442 C3: 0.7450

C4: 13

C4: 0.6704

C4: 20

C4: 0.7438

C4: 0.7450

70 5

C1: 13

C1: 0.7516

C1: 11

C1:0.7321

C1: 0.7330

C2: 13 C2: 0.7429 C2: 13 C2: 0.7429 C2: 0.7330

C3: 15 C3: 0.7999 C3: 11 C3: 0.7403 C3: 0.7330

C4: 18 C4: 0.8404 C4: 10 C4: 0.7260 C4: 0.7330

C5: 11

C5: 0.5326

C5: 25

C5: 0.7261

C5: 0.7330

80 5

C1: 20

C1: 0.7437

C1:20

C1:0.7437

C1: 0.7360 C2: 17 C2: 0.8179 C2: 11 C2: 0.7410 C2: 0.7360

C3: 18 C3: 0.8503 C3:12 C3: 0.7100 C3: 0.7360

C4: 19 C4: 0.8169 C4: 12 C4: 0.7443 C4: 0.7360

C5: 06

C5: 0.4519

C5: 25

C5: 0.7417

C5: 0.7360

Table 5.1 depicts overall simulation results. The simulations are run on Intel core 2 due

having windows 7 as an operating system. All the above figures from Figure 5.2 to

Figure 5.6 are based on the task set size 60 of Table 5.1.

Page 121: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

104

It is clear from the given results in Table 5.1 that task shifting takes less time as

compared to task splitting mechanism.

Another observation is; simulation time taken by task shifting increases as long as the

task set size increases. In task splitting; with the increase in task set size does not always

increase the simulation time; this is due to the fact that in task splitting we are unaware of

the fact that a single task will be split in how many parts. The overall tasks splitting time

is based on two factors; number of tasks to be split and a single task will be split into

how many parts. Therefore in task splitting mechanism, as the task set size increases; it

does not always increase total simulation time.

5.4 Conclusion of the Chapter

The load balancing mechanisms discussed in the Chapter are task shifting and task

splitting. In task shifting mechanism a task having low utilization is selected from a

highly utilized core and shifted (transferred) to a low utilized core. The splitting st rategy

is applied after task shifting as to equally utilize all the computing units/cores. In task

splitting strategy a maximum utilized task from a highly utilized core is selected and

splits its execution time in such a way that if the split portion is assigned to a low utilized

core, the utilization of high and low utilized cores becomes equal. Although the splitting

of a task equates the utilization of cores however, care must be taken to split a task

among cores because task splitting takes more time as compared to task shifting

mechanism.

Page 122: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

105

Chapter 6 Conclusion, Recommendations and Future Directions

Page 123: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

106

6.1 Introduction

This research thesis focuses on power efficient resource allocation in HPC systems. The

overall scope of the work is presented in Chapter 1, including problem statement,

research issues and overall contribution of the research. Meanwhile, analysis of existing

distributed HPC systems based on predefined features were presented in Chapter 2. To

cover HPC systems from power efficient resource allocation perspective, a new approach

called LFS were presented in Chapter 3. In-order to further investigate the power

efficiency in HPC systems the author presented another approach called GA-FFS in

Chapter 4 of the research work. In Chapter 5, the author presented load balancing

mechanisms in-order to balance loads among computing units that may become

unbalanced due to the proposed techniques of Chapter 3 and Chapter 4. Finally, this

chapter concludes the research thesis with some recommendations and future directions.

6.2 Conclusion

Chapter 2 of the thesis provides a detailed comparison and description of the three broad

categories of HPC systems namely Cluster, Grid, and Cloud. The said categories have

been investigated and analyzed in terms of resource allocation. Moreover, the well-

known projects and applications from each category are briefly discussed and

highlighted. Furthermore, the aforementioned projects in Chapter 2 are compared on the

basis of selected common features belonging to the same category. For each category,

more specific characteristics are discussed. The features list can be expanded further for

Cluster, Grid, and Cloud. However, because the scope of Chapter 2 was on resource

allocation only, the selected characteristics allow more clear distinctions at each level of

the classification. The Chapter 2 will help the readers to analyze the gap between what is

already available in existing systems and what is still required, so that outstanding

research issues can be identified. Moreover, the features of cluster, grid, and cloud are

closely related to each other and the said chapter will help to understand the differences.

Furthermore, the systems of each category have been classified under software only and

hardware or hybrid only. The hardware and OS support could be cost prohibitive to end-

users. However, programming level is a big burden to end-users. Amongst the three HPC

categories, grid and cloud computing appears promising and a lot of research has been

conducted in each category. The focus of future HPC systems is to reduce the operational

cost of data centers and increase the resilience to failure, adaptability, and graceful

Page 124: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

107

recovery. The Chapter 2 of this thesis can help new researchers to address the open areas

in the research. Moreover, it also provides the basic information along with the

description of the projects in the broad domain of cluster, grid, and cloud.

In Chapter 3, author integrated dynamic voltage scaling with the fixed priority-

scheduling paradigm. A solution was proposed to find the lowest possible core speed for

a single task. The proposed technique was then applied to the multi-core system to

identify a uniform system speed to conserve energy while maintaining the system timing

requirements. The proposed methodology was compared to existing techniques and the

simulation results presented in Chapter 3 revealed superior performance.

The proposed work in Chapter 4 addressed and improved the speed and power of FFS by

using genetic algorithm. This modified version of FFS is termed as GA-FFS. Author

concluded from the experimental evaluation that, GA-FFS is more efficient in speed and

power consumption than FFS. The GA-FFS also have better results than LFS while

considering required execution time as a testing parameter.

In Chapter 5, author presented two strategies for load balancing among cores or systems

in HPC environment. The first strategy for load balancing is task migrat ion. In Task

migration a lightest task is transferred from a highly utilized core to a low utilized core

and the process is repeated unless the load among cores is approximately balanced. The

other strategy is task splitting. In task splitting strategy cores are fully balanced by

splitting, a task i.e., the execution time of a single task is divided between high-utilized

core and a low utilized core in such a way that cores utilization becomes fully balanced.

As compared to task migration, task-splitting strategy fully balances a specific load

among cores but it is more time consuming than task migration, because it takes extra

time in splitting a task in such a way to balance load among cores.

6.3 Recommendations

A recommendation for new researchers is to read Chapter 2 of the thesis. It will help to

address the open areas in the research. For further recommendations, it is suggested that

use FFS technique in situations where response time is of great importance than energy

consumption. In other words, FFS technique will respond quickly than LFS technique.

Another recommendation is to use LFS technique in situations where power is more

important than response time, as established from the results obtained in Chapter 3. Use

Page 125: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

108

the GA-FFS mechanism in situations where response time and power consumption are of

moderate importance. In case of load balancing among cores or systems, use task shifting

strategy whenever time is important otherwise, use task splitting strategy that will fully

balances load as clear from the results obtained in Chapter 5.

Overall, the main objective of this research work has been to devise intelligent resource

allocation strategies that improve power (energy) consumption in HPC systems. Indeed,

this is a wide-ranging field with several existing prior works and findings. The prime

focus of this research is to allocate the computing resources in high performance

computing environment in such a way to minimize the power (energy). This research

effort adds the HPC systems community from energy perspective with two novel

approaches. These efforts, also opens up some novel directions for future research, which

are detailed next.

6.4 Future Directions

For future research, one option is to conduct a survey on research issues in resource

allocation mechanisms in HPC environment. Furthermore, if possible, can improve an

existing mechanism or develop a new mechanism for resource allocation. Initially the

new mechanism will be for multi-core. If promising results were obtained, then the new

resource allocation mechanism will be extended to distributed HPC systems.

Another future direction is to apply any naturally inspired algorithm for task assignment

problem instead of rate monotonic and check system speed for energy consumption. After

that any speed minimization technique will be applied and energy consumption will be

checked again. Both the energy results (with and without speed minimization technique)

will be compared for ensuring of energy reduction.

The research work presented in Chapter 5 can also be extended. It could be a good option

as a future work to extend the task splitting and task shifting strategies to distributed

HPC systems where a lot of new issues comes like delay time in transferring a task and

so on. More interesting result could be obtained by incorporating the delay time in these

concepts in distributed HPC environment. A distributed HPC environment is more

challenging than multi-core, while implementing these concepts in distributed HPC

environment, may be some more interesting research topic and issue will appear. Another

Page 126: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

109

future direction is the behavior of task splitting strategy by considering the intra-task and

inter-tasks dependencies.

Page 127: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

110

REFERENCES:

[1] Hameed Hussain, Saif-Ur-Rahman Malik, Abdul Hameed, Samee Ullah Khan, et

al. "A survey on resource allocation in high performance distributed computing

systems." Parallel Computing 39.11 (2013): 709-736.

[2] Nasro Min-Allah, Hameed Hussain, Samee Ullah Khan, and Albert Y. Zomaya.

"Power efficient rate monotonic scheduling for multi-core systems." Journal of

Parallel and Distributed Computing 72, no. 1 (2012): 48-57.

[3] Hameed Hussain, Muhammad Bilal Qureshi, Muhammad Shoaib and Sadiq Shah,

"Load balancing through task shifting and task splitting strategies in multi-core

environment." IEEE Eighth International Conference on.Digital Information

Management (ICDIM), 2013: pp. 385-390.

[4] Hameed Hussain, Muhammad Bilal Qureshi and Manzor Illahi Tamimy,

“Minimizing Power Consumption through System Speed using Genetic

Algorithm”, Submitted to The Scientific World Journal (TSWJ), a Hindawi

Journal.

[5] G.L. Valentini, W. Lassonde, S.U. Khan, N. Min-Allah, S.A. Madani, J. Li, L.

Zhang, L. Wang, N. Ghani, J. Kolodziej, H. Li, A.Y. Zomaya, C.-Z. Xu, P. Balaji,

A. Vishnu, F. Pinel, J.E. Pecero, D. Kliazovich, P. Bouvry, “An overview of

energy efficiency techniques in cluster computing systems”, Cluster Computing 16

(1) (2013) 3–15.

[6] F. Pinel, J.E. Pecero, S.U. Khan, P. Bouvry, “Energy-efficient scheduling on

milliclusters with performance constraints”, in: ACM/IEEE International

Conference on Green Computing and Communications (GreenCom) , Chengdu,

Sichuan, China, August 2011, pp. 44–49.

[7] L. Wang, S.U. Khan, D. Chen, J. Kolodziej, R. Ranjan, C.-Z. Xu, A.Y. Zomaya,

“Energy-aware parallel task scheduling in a cluster”, Future Generation Computer

Systems 29 (7) (2013) 1661–1670.

[8] J. Kołodziej, S.U. Khan, L. Wang, M. Kisiel-Dorohinicki, S.A. Madani, E.

Niewiadomska-Szynkiewicz, A.Y. Zomaya, C. Xu, “Security, energy, and

performance-aware resource allocation mechanisms for computational grids”,

Future Generation Computer Systems, October 2012, ISSN 0167-739X,

http://dx.doi.org/10.1016/j.future.2012.09.009.

[9] Grid Computing, http://www.adarshpatil.com/newsite/images/grid-computing.gif,

accessed Feb. 20, 2012

[10] S.U. Khan, “A goal programming approach for the joint optimization of energy

consumption and response time in computational grids”, in: 28th IEEE

International Performance Computing and Communications Conference (IPCCC) ,

Phoenix, AZ, USA, December 2009, pp. 410–417.

[11] D. Chen, L. Wang, X. Wu, J. Chen, S.U. Khan, J. Kolodziej, M. Tian, F. Huang,

W. Liu, “Hybrid modelling and simulation of huge crowd over a hierarchical grid

architecture”, Future Generation Computer Systems 29 (5) (2013) 1309–1317.

[12] J. Kolodziej, S.U. Khan, “Multi-level hierarchical genetic-based scheduling of

independent jobs in dynamic heterogeneous grid environment”, Information

Sciences 214 (2012) 1–19.

[13] Qureshi, Muhammad Bilal, Maryam Mehri Dehnavi, Nasro Min-Allah,

Muhammad Shuaib Qureshi, Hameed Hussain, Ilias Rentifis, Nikos Tziritas et al.

Page 128: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

111

"Survey on Grid Resource Allocation Mechanisms." Journal of Grid Computing

(2014): 1-43.

[14] Y. Amir, B. Awerbuch, A. Barak, R. S. Borgstrom, and A. Keren. “An

Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster”,

IEEE Transactions on Parallel and Distributed Systems, 11(7):760–768, July

2000.

[15] C. Yeo, and R. Buyya, “A Taxonomy of Market-Based Resource Management

Systems for Utility-Driven Cluster Computing”, Software: Practice and

Experience, Vol. 36, No. 13, Nov. 2006, pp. 1381-1419.

[16] C. Diaz, M. Guzek, J. Pecero, P. Bouvry, and S. Khan, “Scalable and Energy-

efficient Scheduling Techniques for Large-scale Systems”, 11th IEEE

International Conference on Computer and Information Technology (CIT), Sep.

2011, pp. 641-647.

[17] J. Kolodiej, S.U. Khan, E. Gelenbe, E.-G. Talbi, “Scalable optimization in grid,

cloud, and intelligent network computing”, Concurrency and Computation:

Practice and Experience 25 (12) (2013) 1719–1721.

[18] G. Andrews, Foundations of Multithreaded, Parallel, and Distributed

Programming, Addison–Wesley, Boston, MA, USA, 2000.

[19] H. Xin, L. KenLi, L. RenFa, “A energy efficient scheduling base on dynamic

voltage and frequency scaling for multi-core embedded real-time system”, in:

Algorithms and Architectures for Parallel Processing, in: LNCS , vol. 5574, 2009,

pp. 137–145 (Chapter).

[20] E. Humenay, D. Tarjan, K. Skadron, “Impact of process variations on multicore

performance symmetry”, In: Proceedings of the Conference on Design,

Automation and Test in Europe, 2007, pp. 1653-1658

[21] J. Sartori, A. Pant, R. Kumar, P. Gupta, “Variation aware speed binning of

multicore processors”, in: Proceedings of the 11-th IEEE International

Symposium on Quality Electronic Design, 2010, pp. 307–314.

[22] A.P. Chandrakasan, S. Sheng, R.W. Brodersen, “Low power CMOS digital

design”, IEEE J. Solid State Circuits (1992) 472–484.

[23] T.D. Burd, T.A. Pering, A.J. Stratakos, R.W. Brodersen, “A dynamic voltage

scaled microprocessor system”, IEEE J. Solid State Circuits 35 (11) (2000) 1571–

1580.

[24] T. Gloker, H. Meyr, “Design of Energy-Efficient Application-Specific Instruction

Set Processors”, Kluwer Academic Publisher, Dordrecht, 2004.

[25] T. Ishihara, H. Yashura, “Voltage scheduling problem for dynamically variable

voltage processors”, in: International Symposium on Low Power Electronics and

Design, 1998, pp. 197–202.

[26] J.L.W.V. Jensen, Sur les fonctions convexes et “les inegalites entreles valeurs

moyennes”, Acta Math. 30 (1) (1906) 175–193.

[27] V. Raghunathan, C. Pereira, M. Srivastava, R. Gupta, “Energy aware wireless

systems with adaptive power-fidelity tradeoffs”, IEEE Trans. Very Large Scale

Integr. (VLSI) Syst. 13 (2) (2005).

[28] W. Lee, H. Kim, H. Lee, “Maximum-utility scheduling of operation modes with

probabilistic task execution times under energy constraints”, IEEE Trans. Comput.

Aided Des. Integr. Circuits Syst. 28 (10) (2009) 1531.

[29] C.L. Liu, J.W. Layland, “Scheduling algorithms for multiprogramming in a hard

real-time environment”, Journal of the ACM 20 (1) (1973) 40–61.

Page 129: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

112

[30] N. Min-Allah, S.U. Khan, “A hybrid test for faster feasibility analysis of periodic

tasks”, IJICIC 7 (10) (2011) 5689–5698.

[31] R.I. Davis, T. Rothvo, S.K. Baruah, A. Burns, “Exact quantification of the

suboptimality of uniprocessor fixed priority pre-emptive scheduling”, Real Time

Syst. 43 (3) (2009) 211–258.

[32] C.M. Krishna, Kang G. Shin, “Real-time Systems”, Tsinghua University Press,

McGraw-Hill, 2001.

[33] A. Burns, A.J. Wellings, “Real-Time Systems and Programming Languages”, 4th

ed., Addison Wesley, 2009, 602 pages.

[34] K. Lakshmanan, R. Rajkumar, J.P. Lehoczky, “Partitioned fixed-priority

preemptive scheduling for multi-core processors”, in: Proceedings of the 21st

Euromicro Conference on Real-Time Systems, 2009, pp. 239–148.

[35] S. Saewong, R. Rajkumar, “Practical voltage-scaling for fixed priority rt-

systems”, in: Proceedings of the 9th IEEE Real-Time and Embedded Technology

and Applications Symposium, RTAS03, 2003, pp. 106–115.

[36] N. AbouGhazaleh, B. Childers, D. Mosse, R. Melhem, M. Craven, “Energy

management for real-time embedded applications with compiler support”, in:

ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded

Systems, 2003, pp. 284–293.

[37] H. Aydin, R. Melhem, D. Mosse, P. Alvarez, “Dynamic and aggressive scheduling

techniques for power-aware real-time systems”, in: Proc. IEEE Real-Time Syst.

Symp., 2001, p. 95.

[38] J. Anderson, S. Baruah, “Energy-efficient synthesis of periodic task systems upon

identical multiprocessor platforms”, in: Proc. Distributed Computing Systems,

24th International Conference, 2004, pp. 428–435.

[39] P. Pillai, K.G. Shin, “Real-time dynamic voltage scaling for lowpower embedded

operating systems”, in: Proceedings of the 18th ACM Symposium on Operating

Systems Principles, 2001, pp. 21–24.

[40] F. Li, F.F. Yao, “An efficient algorithm for computing optimal discrete voltage

schedules”, SIAM J. Comput. 35 (2005) 658–671.

[41] F. Zhang, S. Chanson, “Processor voltage scheduling for realtime tasks with non-

preemptible sections”, in: Real-Time System Symposium, Austin, TX, Dec. 2002.

[42] J. Brateman, C. Xian, Y. Lu, “Frequency and speed setting for energy

conservation in autonomous mobile robots”, in: Proceedings of the IFIP

International Federation for Information Processing , vol. 249/2008, 2008, pp.

197–216.

[43] J.W.S. Liu, “Real Time Systems”, Prentice Hall, 2000.

[44] L. George, N. Riverre N, M. Spuri, “Preemptive and Non-Preemptive Real-Time

Uniprocessor Scheduling”, Research Report 2966, INRIA, France, 1996.

[45] N. Min-Allah, SU Khan, Y. Wang, “Optimal task execution times for periodic

tasks using nonlinear constrained optimization”, J. Supercomput. (2010)

doi:10.1007/s11227-010-0506-z.

[46] E. Bini, G.C. Buttazzo, G. lipari, “Minimizing CPU energy in real time systems

with discrete speed management”, ACM Trans. Embedded Comput. Syst. 8 (4)

(2009).

[47] N. Min-Allah, I. Ali, J. Xing, Y. Wang, “Utilization bound for periodic task set

with composite deadline”, J. Comput. Electr. Eng. 36 (6) (2010) 1101–1109.

Page 130: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

113

[48] J.Y.T. Leung, J. Whitehead J., “On the complexity of fixed-priority scheduling of

periodic”, Real-time tasks performance evaluation 2 (1982) 237–250.

[49] J.P. Lehoczky, “Fixed priority scheduling of periodic task sets with arbitrary

deadline”, in: Proceedings of the 11-th IEEE Real-Time System Symposium, 1990,

pp. 201–209.

[50] E. Seo, Y. Koo, J. Lee, “Dynamic repartitioning of real-time schedule on a

multicore processor for energy efficiency”, in: LNCS, vol. 4096/2006, 2006, pp.

69–78.

[51] Sastry, Kumara, David Goldberg, and Graham Kendall. "Genetic algorithms." In

Search methodologies, pp. 97-125. Springer US, 2005.

[52] Hull, David L. "Darwin and his critics: The reception of Darwin's theory of

evolution by the scientific community." (1973).

[53] Paul, Diane B. "The selection of the “Survival of the Fittest”." Journal of the

History of Biology 21.3 (1988): 411-424.

[54] Kokkinos et.al, “A framework for providing hard delay guarantees and user

fairness in Grid Computing”. June 2009.

[55] V.Chauhan .et.al. “Motivation for Green Computer, Methods Used in Computer

Science Program”, In National Postgraduate Conference (NPC), 19-20

September, 2011,pp:1-5, doi: 10.1109/NatPC.2011.6136287

[56] N. Sadashiv, and S. Kumar, “Cluster, Grid and Cloud Computing: A Detailed

Comparison,” 6th International Conference on Computer Science & Education

(ICCSE), Sep. 2011, pp. 477-482.

[57] J.L.W.V. Jensen, Su les fonctions convexes et les inegalites entrles valeurs

moyennes, Acta Math, 30(1) (1906) 175-193

[58] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud Computing and Grid Computing

360-Degree Compared”, Grid Computing Environments Workshop

2008(GCE’08), Nov. 2008, pp. 1-10.

[59] “What is the difference between Cloud, Cluster and Grid Computing?”,

http://www.cloud-competence-center.de/understanding/difference-cloud-cluster-

grid/, accessed Feb. 5, 2011.

[60] Amazon‟s HPC cloud: supercomputing for the 99%,

http://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-

the-99/, accessed 11 Sep 2012.

[61] F.Dong, and S.G. Akl, “Scheduling Algorithms for Grid Computing: State of Art

and Open Problems”. QueensUniversity. Technical report.

http://www.cs.queensu.ca/TechReports/Reports/2006-504.pdf.

[62] G. Valentini, S. Khan, and P. Bouvry, “Energy-efficient Resource Utilization in

Cloud Computing,” Large Scale Network-centric Computing Systems, A. Y.

Zomaya and H. Sarbazi-Azad, eds., John Wiley & Sons, Hoboken, NJ, USA. 2013,

ISBN: 978-0-470-93688-7, Chapter 16.

[63] K. Ramamritham, and J. Stankovic, “Scheduling Algorithm and Operating System

Support for Real-time Systems,” Proceedings of IEEE, Vol. 82, No. 1, Aug. 2002,

pp.55-67.

[64] P. Berstein, http://research.microsoft.com/en-us/people/philbe/chapter3.pdf,

accessed July. 25, 2011.

[65] P. Wieder, O. Waldrich, W. Ziegler, “Advanced Techniques for Scheduling,

Reservation and Access Management for Remote Laboratories and Instruments,”

Page 131: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

114

2nd IEEE International Conference on e-Science and Grid Computing (e-

Science’06), Dec. 2006, pp. 128-128.

[66] “Continuous Availability for Enterprise Messaging: Reducing Operational Risk

and Administration Complexity”,

http://www.progress.com/docs/whitepapers/public/sonic/sonic_caa.pdf, accessed

Feb. 12, 2011.

[67] “Cluster Computing”, http://searchdatacenter.techtarget.com/definition/cluster-

computing, accessed Feb. 03, 2011.

[68] “Parallel Sysplex”, http://www-03.ibm.com/systems/z/advantages/pso/, accessed

Feb.22, 2011.

[69] L. Kal‟e and S. Krishnan, “Charm++: Parallel Programming with Message-driven

Objects,” Parallel Programming Using C++, G. V. Wilson and P. Lu, eds., MIT

Press, Cambridge, MA, USA, 1996, pp. 175-213.

[70] P. Brucker, “Scheduling Algorithms”, 4th edition, Springer-Verlag, Guildford,

Surrey, UK, 2004.

[71] L. Wang, J. Tao, H. Marten, A. Streit, S.U. Khan, J. Kolodziej, D. Chen, “Map

Reduce across distributed clusters for data-intensive applications”, in: 26th IEEE

International Parallel and Distributed Processing Symposium (IPDPS) , Shanghai,

China, May 2012, pp. 2004–2011.

[72] G. White, and M. Quartly,

http://www.ibm.com/developerworks/systems/library/es-linuxclusterintro,

accessed Feb. 15, 2012.

[73] P. Lindberg, J. Leingang, D. Lysaker, K. Bilal, S. Khan, P. Bouvry, N. Ghani, N.

Min-Allah, and J. Li, “Comparison and Analysis of Greedy Energy-Efficient

Scheduling Algorithms for Computational Grids,” Energy Aware Distributed

Computing Systems, A. Y. Zomaya and Y.-C. Lee, eds., John Wiley & Sons,

Hoboken, NJ, USA.

[74] Microsoft Live Mesh, http://www.mesh.com, accessed Feb. 12, 2011.

[75] I. Foster, C. Kesselman, and S. Tuecke “The Anatomy of the Grid,” International

Journal of Supercomputer Applications, Vol. 15, No. 3, Aug. 2001, pp. 200-222.

[76] D. Irwin, L. Grit, and J. Chas, “Balancing Risk and Reward in a Market-based

Task Service,” 13th International Symposium on High Performance Distributed

Computing (HPDC13), June 2004 pp. 160-169.

[77] C. Yeo and R. Buyya, “Service Level Agreement based Allocation of Cluster

Resources: Handling Penalty to Enhance Utility,” 7th IEEE International

Conference on Cluster Computing (Cluster 2005), Sep. 2005.

[78] R. Buyya, R. Ranjan and R. N. Calheiros, “Modeling and Simulation of Scalable

Cloud Computing Environments and the CloudSim Toolkit: Challenges and

Opportunities”. Pro-ceedings of the 7th High Performance Computing and

Simulation Conference (HPCS 2009, IEEE Press, New York, USA), Leipzig,

Germany, June 21-24, 2009.

[79] S. Toyoshima, S. Yamaguchi, and M. Oguchi, “Storage Access Optimization

withVirtual Machine Migration and Basic Performance Analysis of Amazon

EC2,” IEEE 24th International Conference on Advanced Information Networking

and Applications Workshops, Apr. 2010, pp. 905-910.

[80] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, andD.

Zagorodnov, “The Eucalyptus Open-Source Cloud Computing System,” 9th

Page 132: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

115

IEEE/ACM International Symposium on Cluster Computing and the Grid

(CCGRID ’09), May 2009, pp. 124-31.

[81] K. K. Droegemeier, D. Gannon, D. Reed, B. Plale, J. Alameda, T. Baltzer, K.

Brewster, R. Clark, B. Domenico, S. Graves, E. Joseph, D. Murray, R.

Ramachandran, M. Ramamurthy, L. Ramakrishnan, J. A. Rushing, D. Weber, R.

Wilhelmson, A. Wilson, M. Sue, and S. Yalda,“Service-Oriented Environments

for Dynamically Interacting with Mesoscale Weather”, Computing in Science and

Engg., Vol. 7, No. 6, 2005, pp.12–29.

[82] J. Sherwani, N. Ali, N. Lotia, Z. Hayat, and R. Buyya, “Libra: A Computational

Economy-based Job Scheduling System for Clusters,” Software: Practice and

Experience, Vol. 34, No. 6, May 2004, pp. 573-590.

[83] T. Casavant, and J. Kuhl, “A Taxonomy of Scheduling in General-purpose

Distributed Computing Systems,” IEEE Transactions on Software Engineering,

Vol. 14, No.2, Jan. 1988, pp. 141-154.

[84] J. Regeh, J. Stankovic, and M. Humphrey, “The Case for Hierarchical Schedulers

with Performance Guarantees”, TR-CS 2000-07, Department of Computer

Science, University of Virginia, Mar. 2000, 9 pp.

[85] R. Wolski, N. Spring, and J. Hayes, “Predicting the CPU Availability of Time-

shared Unix Systems on the Computational Grid,” Proceedings of the 8th High-

Performance Distributed Computing Conference, Aug.1999.

[86] N. Arora, R. Blumofe, C. Plaxton, “Thread Scheduling for Multi-programmed

Multi-processors,” Theory of Computing Systems, Vol. 34, No. 2, 2001, pp. 115-

144.

[87] T. Xie, A. Sung, X. Qin, M. Lin, and L. Yang, “Real-time Scheduling with Quality

of Security Constraints,” International Journal of High Performance Computing

and Networking, Vol. 4, No. 3, 2006, pp. 188-197.

[88] L. Wang, J. Tao, H. Marten, A. Streit, S. Khan, J. Kolodziej, and D. Chen, “Map

Reduce across Distributed Clusters for Data-intensive Applications,” 26th IEEE

International Parallel and Distributed Processing Symposium (IPDPS) , May

2012.

[89] S. Iqbal, R. Gupta and Y. Lang, “Job Scheduling in HPC Clusters”, Power

Solutions, Feb. 2005, pp. 133-135.

[90] M. Jette, A. Yoo, and M. Grondona “SLURM: Simple Linux Utility for Resource

Management”, D. G. Feitelson and L. Rudolph eds., Job Scheduling Strategies for

Parallel Processing, 2003, pp. 37-51.

[91] S. Senapathi, D. K. Panda, D. Stredney, and H.-W. Shen,“A QoS Framework for

Clusters to support Applications with Resource Adaptivity and Predictable

Performance,” Proceedings of the IEEE International Workshop on Quality of

Service (IWQoS), May.

[92] K. H. Yum, E. J. Kim, and C. Das, “QoS provisioning in clusters: an

investigationof router and NIC design”, In ISCA-28, 2001.

[93] J. Leung, “Handbook of Scheduling: Algorithms, Models, and Performance

Analysis, First Edition”, CRC Press, Inc., Boca Raton, FL, USA, 2004.

[94] S. Ali, T.D. Braun, H.J. Siegel, A.A. Maciejewski, N. Beck, L. Boloni, M.

Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, B. Yao, “Characterizing

resource allocation heuristics for heterogeneous computing systems,” in: A.R.

Hurson (Ed.), Advances in Computers, vol. 63: Parallel, Distributed, and

Pervasive Computing, Elsevier, Amsterdam,The Netherlands, 2005, pp. 91–128.

Page 133: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

116

[95] P. Dutot, L. Eyraud, G. Mounie, and D. Trystram, “Bi-criteria Algorithm for

Scheduling Jobs on Cluster Platforms,” 16th ACM Symposium on Parallelism in

Algorithms and Architectures (SPAA), July 2004, pp. 125-132.

[96] F. Pinel, J. Pecero, P. Bouvry, and S. Khan, “A Two-Phase Heuristic for the

Scheduling of Independent Tasks on Computational Grids,” ACM/IEEE/IFIP

International Conference on High Performance Computing and Simulation

(HPCS), July 2011, pp. 471-477.

[97] J. Kolodziej, S. Khan, and F. Xhafa, “Genetic Algorithms for Energy-aware

Scheduling in Computational Grids,” 6th IEEE International Conference on P2P,

Parallel, Grid, Cloud, and Internet Computing (3PGCIC), Oct. 2001, pp. 17-24.

[98] K. Rzadca, “Scheduling in multi-organization grids: Measuring the inefficiency of

decentralization,”7th International Conference on Parallel Processing and

Applied Mathematics, Gdansk,Poland, 2007, pp.1048-1058.

[99] E. Huedo, R. Montero, and I. Llorente, “A Framework for Adaptive Execution in

Grids,” Software-Practice and Experience, Vol. 34, No. 07, June 2004, pp.631-

651.

[100] S. Chapin, J. Karpovich, and A. Grimshaw, “The Legion Resource Management

System,” 5th Workshop on Job Scheduling Strategies for Parallel Processing ,

Apr.1999, pp.162-178.

[101] N. Kapadia, and J. Fortes, “PUNCH: An Architecture for Web-enabled Wide-area

Network-computing,” The Journal of Networks, Software Tools and Applications,

Special Issue on High Performance Distributed Computing , Vol. 2, No. 2,

Sep.1999, pp.153-164.

[102] B. Lowekamp, “Combining Active and Passive Network Measurements to Build

Scalable Monitoring Systems on the Grid,” ACM SIGMETRICS Performance

Evaluation Review, Vol. 30, No. 4, 2003, pp.19-26.

[103] M. Litzkow, M. Livny, and M. Mutka, “Condor- A Hunter of Idle

Workstations,”8th International Conference of Distributed Computing Systems ,

June 1988, pp. 104 - 111.

[104] L. Wang, W. Jie, and J. Chen, “Grid Computing: Infrastructure, Service, and

Applications, Kindle Edition”, CRC Press 2009, pp. 338.

[105] S. Khan and I. Ahmad, “A Cooperative Game Theoretical Technique for Joint

Optimization of Energy Consumption and Response Time in Computational

Grids,” IEEE Transactions on Parallel and Distributed Systems, Vol. 20, No. 3,

2009, pp. 346-360.

[106] L. Wang and S. Khan, “Review of Performance Metrics for Green Data Centers: A

Taxonomy Study,” Journal of Supercomputing.

[107] GENI,http://www.geni.net, acessed Feb. 02, 2011

[108] Google Nimbus, http://www.nimbusproject.org/doc/nimbus/faq/, accessed Apr.

05, 2012.

[109] Open Nebula, http://opennebula.org/, accessed Apr. 05, 2012.

[110] F. Lombardi, andR. DiPietro, “Secure Virtualization for Cloud Computing,”

Journal of Network and Computer Application”, Vol. 34, No. 4, July 2011,

pp.1113-1122.

[111] D. Benslimane, D. Schahram, and S. Amit, “Services Mashups: The New

Generation of Web Applications,” IEEE Internet Computing, Vol. 12, No. 5, Feb.

2008, pp. 13-15.

Page 134: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

117

[112] L. Skorin-Kapov, M. Matijasevic, “Dynamic QoS Negotiation and Adaptation for

Networked Virtual Reality Services,” IEEE WoWMoM ’05, Taormina, Italy,

June2005, pp. 344–51.

[113] Vmware, Inc.,

http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf, accessed

July 23, 2011.

[114] HP Cloud, https://www.hpcloud.com/pricing, accessed August 08, 2013.

[115] A. Barak and O. La‟adan, “The MOSIX Multicomputer Operating System for

High Performance Cluster Computing,” Future Generation Computer Systems,

Vol. 13, No. 4-5, Mar. 1998, pp. 361–372.

[116] Gluster,www.gluster.org, accessed Feb. 16, 2012.

[117] L. Kal‟e, S. Kumar, M. Potnuru, J. DeSouza, and S. Bandhakavi, “Faucets:

Efficient Resource Allocation on the Computational Grid,” 33rd International

Conference on Parallel Processing (ICPP 2004) , Aug. 2004.

[118] M. Bhandarkar, L. Kal‟e, E. Sturler, and J. Hoeflinger, “Adaptive Load Balancing

for MPI Programs,” Lecture Notes in Computer Science (LNCS), Vol. 2074, May

2001, pp. 108-117.

[119] L. Kal‟e, S. Kumar, and J. DeSouza, “A Malleable-job System for Timeshared

Parallel Machines,” 2nd International Symposium on Cluster Computing and the

Grid (CCGrid 2002), pp. 215-222, May 2002.

[120] DQS, http://www.msi.umn.edu/sdvl/info/dqs/dqs-intro.html, accessed Mar. 20,

2011

[121] K. Lai, L. Rasmusson, E. Adar, L. Zhang, and B. Huberman, “Tycoon: An

Implementation of a Distributed, Market-based Resource Allocation System,”

Multiagent Grid System, Vol. 1, No. 3, Aug. 2005, pp. 169-182.

[122] A. Bernardo, H. Lai, and L. Fine, “Tycoon: A Distributed Market-based Resource

Allocation System”, TR-arXiv:cs.DC/0404013, HP Labs, Palo Alto, CA, USA,

Feb. 2008, 8 pp.

[123] J. Chase, D. Irwin, L. Grit, J. Moore, and S. Sprenkle, “Dynamic Virtual Clusters

in a Grid Site Manager,” 12th International Symposium on High Performance

Distributed Computing (HPDC12), June 2003, pp. 90-100.

[124] C. Morin, R. Lottiaux, G. Vallee, P. Gallard, G. Utard, R. Badrinath, and L.

Rilling, “Kerrighed: A Single System Image Cluster Operating System for High

Performance Computing,” Proceedings of Europar 2003 Parallel Processing,

Lecture Notes in Computer Science, Vol. 2790,Aug. 2003, pp. 1291-1294.

[125] Kerrighed,http://kerrighed.org/wiki/index.php/Main_Page, accessed Mar. 15,

2011.

[126] OpenSSI,http://openssi.org/cgi-bin/view?page=openssi.html, accessed Mar. 12,

2011.

[127] C. Yeo and R. Buyya, “Pricing for Utility-driven Resource Management and

Allocation in Clusters,” International Journal of High Performance Computing

Applications, Vol. 21, No. 4, Nov. 2007, pp. 405-418.

[128] P. Springer, “PVM Support for Clusters,” 3rd IEEE International Conference on

Cluster Computing (CLUSTER’01), Oct. 2001.

[129] B. Chun, and D. Culler, “Market-based Proportional Resource Sharing for

Clusters”, TR-CSD-1092, Computer Science Division, University of California,

Berkeley, USA, Jan. 2000, 19 pp.

Page 135: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

118

[130] GNQS, http://gnqs.sourceforge.net/docs/starter_pack/introducing/index.html,

accessed Jan. 20, 2011

[131] Workload Management with Load Leveler,

http://www.redbooks.ibm.com/abstracts/sg246038.html, accessed Feb. 10, 2011

[132] http://www.platform.com/workload-management/high-performance-computing,

accessed Aug. 07, 2011.

[133] Research Computing and Cyber Infrastructure,

http://rcc.its.psu.edu/user_guides/system_utilities/pbs/, accessed Aug. 07, 2011.

[134] J. Basney, and M. Livny, “Deploying a High Throughput Computing Cluster,”

High Performance Cluster Computing, Vol. 1, R. Buyya, eds., Prentice Hall, pp.

116-134, 1999.

[135] R. Buyya, D. Abramson, and J. Giddy, “A Case for Economy Grid Architecture

for Service Oriented Grid Computing,” 15th International Parallel and

Distributed Processing Symposium, Apr. 2001, pp. 776-790.

[136] D.Batista, and N. Fonseca, “A Brief Survey on Resource Allocation in Service

Oriented Grids,” IEEE Globecom Workshops, Nov. 2007, pp. 1-5.

[137] H. Nakada, M. Sato, and S. Sekiguchi, “Design and Implementation of Ninf:

Towards a Global Computing Infrastructure,” Future Generation Computing

Systems (Meta-computing Special Issue), Vol. 15, No. 5-6, Oct. 1999, pp. 649-

658.

[138] K. Krauter, R. Buyya, and M. Maheswaran, “A Taxonomy and Survey of Grid

Resource Management Systems for Distributed Computing,” Journal of Software

Practice and Experience, 2002, pp. 135-164, (DOI: 10.1002/spe.432).

[139] R. Al-Ali, A. Hafid, O. Rana, and D. Walker, “QoS Adaptation in Service-

Oriented Grids,” Performance Evaluation, Vol. 64, No.7-8, Aug. 2007, pp. 646-

663.

[140] R. Al-Ali, O. Rana, D. Walker, S. Jha, andS. Sohail, “G-QoSM: Grid Service

Discovery using QoS Properties,” Computing and InformaticsJournal, Special

Issue on Grid Computing, Vol. 21, No.4, Aug. 2002, pp. 363-382.

[141] I. Foster, C. Kesselman, J. Nick, S. Tuecke, “The physiology of the grid an open

grid services architecture for distributed systems integration”, Argonne National

Laboratory, Mathematics and Computer Science Division Chicago, Jan. 2002, 37

pp., www.globus.org/research/papers/ogsa.pdf.

[142] M. Neary, A. Phipps, S. Richman, and P. Cappello, “Javelin 2.0: Java-based

Parallel Computing on the Internet,” European Parallel Computing Conference

(Euro-Par 2000), Aug. 2000, pp. 1231-1238.

[143] R. Wolski, N. Spring, and J. Hayes, “The Network Weather Service: ADistributed

Resource Performance Forecasting Service for Metacomputing ,” Future

Generation Computer Systems, Vol. 15, No. 5, 1999, pp. 757-768.

[144] D. Andresen and T. McCune, “Towards a hierarchical scheduling system for

distributed WWW server clusters,” Proceedings of the Seventh IEEE International

Symposium on High Performance Distributed Computing (HPDC).

[145] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, “Application level

scheduling on distributed heterogeneous networks,” Proceedings of

Supercomputing 1996.

[146] N. Spring and R. Wolski, “Application level scheduling: Gene sequence library

comparison,” Proceedings of ACM International Conference on Supercomputing ,

July 1998.

Page 136: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

119

[147] X. Sun, and M. Wu, “GHS: A Performance System of Grid Computing,” 19th

IEEE International Parallel and Distributed Processing Symposium , Apr. 2005.

[148] B. Cooper, and H. Garcia-Molina, “Bidding for Storage Space in a Peer-to-Peer

Data Preservation System,” 22nd International Conference on Distributed

Computing Systems (ICDCS 2002), July 2002, pp. 372-381.

[149] D. Carvalho, F. Kon, F. Ballesteros, M. Romn, R. Campbell, and D. Mickunas,

“Management of Execution Environments in 2K,” 7th International Conference on

Parallel and Distributed Systems (ICPADS ’00) , July 2000, pp. 479-485.

[150] F. Kon, R. Campbell, M. Mickunas, and K. Nahrstedt, “2K: A Distributed

Operation System for Dynamic Heterogeneous Environments,” 9th IEEE

International Symposium on High Performance Distributed Computing (HPDC

’00), Aug. 2000, pp.201-210.

[151] M. Roman, F. Kon, and R. H. Campbell, “Design and Implementation of Runtime

Reflection in Communication Middleware the Dynamic Use Case,” Workshop on

Middleware (ICDCS’99), May 1999.

[152] D. Schmidt, Distributed Object Computing with CORBA Middleware,

http://www.cs.wustl.edu/~schmidt/corba.html, accessed Feb. 4, 2011.

[153] F. Berman, and R. Wolski, “The AppLeS Project: A Status Report,” 8th NEC

Research Symposium, May 1997.

[154] P. Chandra, A. Fisher, C. Kosak, Ng. TSE, P. Steenkiste, E. Takahashi, and H.

Zhang, “Darwin: Customizable Resource Management for Value-added Network

Services,” 6th IEEE International Conference on Network Protocols, Oct. 1998.

[155] G. Allen, D. Angulo, I. Foster, G. Lanfermann, C. Liu, T. Radke, E. Seidel, and J.

Shalf, “The Cactus Worm: Experiments with Dynamic Resource Discovery and

Allocation in a Grid Environment,” International Journal of High Performance

Computing Applications, Vol. 15, No. 4, 2001, pp. 345-358.

[156] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson,

K. Kennedy, C. Kesselman, J. Mellor-Crummey, D.Reed, L. Torczon, and R.

Wolski, “The GrADS Project: Software Support for High-level Grid Application

Development,” International Journal of Supercomputer Applications, Vol. 15, No.

04, 2001, pp.327-344.

[157] N. Kapadia, R. Figueiredo, and J. Fortes, “PUNCH: Web portal for running tools,”

IEEE Micro, Vol. 20, No. 3, June 2000, pp. 38-47.

[158] D. Abramson, J. Giddy, and L. Kotler, “High Performance Parametric Modeling

with Nimrod/G: Killer Application for the Global Grid?” International Parallel

and Distributed Processing Symposium (IPDPS 2000) , May 2000, pp. 520-528.

[159] R. Buyya, D. Abramson, and J. Giddy, “Nimrod/G: An Architecture for a

Resource Management and Scheduling System in a Global Computational Grid,”

International Conference on High Performance Computing in Asia–Pacific Region

(HPC Asia 2000), May 2000, Vol. 1, pp. 283-289.

[160] R. Buyya, J. Giddy, and D. Abramson, “An Evaluation of Economy-based

Resource Trading and Scheduling on Computational Power Grids for Parameter

Sweep Applications,” 2nd International Workshop on Active Middleware Services

(AMS’00), Aug. 2000.

[161] G. Valentini, W. Lassonde, S. Khan, N. Min-Allah, S. Madani, J. Li, L. Zhang, L.

Wang, N. Ghani, J. Kolodziej, H. Li, A. Zomaya, C. Xu, P. Balaji, A. Vishnu, F.

Pinel , J. Pecero , D. Kliazovich, and P. Bouvry, “An Overview of Energy

Page 137: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

120

Efficiency Techniques in Cluster Computing Systems,” Cluster Computing, pp. 1-

13, Sep. 2011, DOI: 10.1007/s10586-011-0171-x.

[162] L. Kotler, D. Abramson, P. Roe, and D. Mather, “Activesheets: Super-computing

with Spreadsheets,” Advanced Simulation Technologies Conference High

Performance Computing Symposium (HPC’01), Apr. 2001.

[163] H. Casanova, and J. Dongarra, “Netsolve: A Network-enabled Server for Solving

Computational Science Problems,” International Journal of Supercomputer

Applications and High Performance Computing, Vol. 11, No. 3, 1997, pp. 212-

223.

[164] J. Gehring, and A. Streit, “Robust Resource Management for Metacomputers,” 9th

IEEE International Symposium on High Performance Distributed Computing ,

Aug. 2000.

[165] I. Foster and C. Kesselman. “Globus: A Metacomputing Infrastructure Toolkit,”

International Journal of Supercomputer Applications, Vol. 11, No. 2, 1996, pp.

115-128.

[166] S. Andrew, and W. Wulf, “The Legion Vision of a Worldwide Virtual Computer,”

Communications of ACM, Vol. 40, No. 1, Jan. 1997, pp. 39-45.

[167] Amazon Elastic Compute Cloud (EC2), http://www.amazon.com/ec2/, accessed

Feb. 10, 2011.

[168] Z. Hill, and M. Humphrey, “A Quantitative Analysis of High Performance

Computing with Amazon‟s EC2 Infrastructure: The Death of the Local Cluster?,”

10th IEEE/ACM International Conference on Grid Computing , Oct. 2009, pp. 26-

33.

[169] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I.

Pratt, and A. Warfield, “Xen and the art of Virtualization,” ACM Symposium on

Operating Systems Principles (SOSP), Vol. 37, No. 5, Oct. 2003, pp.164-177.

[170] A. Bedra, “Getting Started with Google App Engine and Clojure,” IEEE Journal

of Internet Computing,Vol. 14, No. 4, 2010, pp. 85-88.

[171] Google App Engine, http://appengine.google.com, accessed Feb. 02, 2011.

[172] I. Baldine, Y. Xin, A. Mandal, andC. Heermann, “Networked Cloud Orchestration:

A GENI Perspective,” IEEE GLOBECOM Workshops, Dec. 2010, pp. 573-578.

[173] Slice Federation Architecture 2.0, GENI,

http://groups.geni.net/geni/wiki/SliceFedArch, accessed Sep 17, 2012.

[174] Sun Network.com (Sun Grid), http://www.network.com, accessed Feb. 08, 2011.

[175] W. Gentzsch, “Sun Grid Engine: Towards Creating a Compute Power Grid,” 1st

IEEE/ACM International Symposium on Cluster Computing and the Grid , May.

2001, pp. 35-36.

[176] L. Uden, and E. Damiani, “The Future of E-learning: E-learning Ecosystem,” 1st

IEEE International Conference on Digital Ecosystems and Technologies, June.

2007, pp. 113-117.

[177] V. Chang, and C. Guetl, “E-Learning Ecosystem (ELES)-A Holistic Approach for

the Development of More Effective Learning Environment for Small-and-Medium

Sized Enterprises,” 1st IEEE International Conference on Digital Ecosystems and

Technologies, Feb. 2007, pp. 420-425.

[178] B. Dong, Q. Zheng, J. Yang, H. Li, andM. Qiao,“An E-learning Ecosystem Based

on Cloud Computing Infrastructure,” 9th IEEE International Conference on

Advanced Learning Technologies, July 2009, pp. 125 – 127.

Page 138: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

121

[179] X. Chu, K. Nadiminti, C. Jin, S. Venugopal,and R. Buyya, “Aneka: Next -

Generation Enterprise Grid Platform for e-Science and e-Business Applications,”

3rd IEEE International Conference on e-Science and Grid Computing, Dec. 2007,

pp. 151-159.

[180] A. Chien, B. Calder, S. Elbert, and K. Bhatia, “Entropia: Architecture and

Performance of an Enterprise Desktop Grid System,” Journal of Parallel and

Distributed Computing, Vol. 63, No. 5, May 2003, pp.597-610.

[181] OpenSatck, http://openstack.org/downloads/openstack-overview-datasheet.pdf,

accessed Apr. 04, 2012.

[182] E. Bini, G.C. Buttazzo, G. Buttazzo, “Rate monotonic analysis: the hyperbolic

bound”, IEEE Trans. Comput. 7 (52) (2003) 933–942.

[183] A.P. Chandrakasan, R.W. Brodersen, Low Power Design, Kluwer Academic

Publishers, Dordrecht, 1995.

[184] Crusoe Processor Model TM5800 Specifications, http://www.charmed.com/

PDF/TM5800.pdf,2011.

[185] N. Min-Allah, Y. Wang, X. Jian-Sheng, J. Liu, “Revisiting fixed priority

techniques”, in: Proceedings of Embedded and Ubiquitous Computing, EUC07,

in: LNCS, vol. 4808, 2007, pp. 134–145.

[186] Daniel Grosu et.al, “Noncooperative load balancing in distributed systems”, J.

Parallel Distrib. Comput. 65 (2005) 1022 – 1034.

[187] R. Mirchan daney, D. Towsley, J. Stankovic, “Adaptive load sharing in

heterogeneous systems”, in: Proceedings of the Ninth IEEE International

Conference on Distributed Computing Systems, June 1989, pp. 298–306.

[188] M.H. Wille beek-Le Mair, A.P. Reeves, “Strategies for dynamic load balancing on

highly parallel computers”, IEEE Trans. Parallel Distributed Systems 4 (9)

(September 1993) 979–993.

[189] D. Grosu, A.T. Chronopoulos, M.Y. Leung, “Load balancing in distributed

systems: an approach using cooperative games”, in: Proceedings of the

International Parallel and Distributed Processing Symposium , April 2002, pp. 52–

61.

[190] H. Kameda, J. Li, C. Kim, Y. Zhang, “Optimal Load Balancing in Distributed

Computer Systems”, Springer, London, 1997.

[191] Andrey G. et al. “Load balancing algorithms based on gradient methods and their

analysis through algebraic graph theory”, J. Parallel Distrib. Comput. 68 (2008)

209 – 220.

[192] F. Lin, R. Keller, “The gradient model load balancing method”, IEEE Trans.

Software Engrg. 1 (1987) 32–38.

[193] R. Lüling, B. Monien, F. Ramme, “A study on load balancing algorithms,

Technical Report”, Universität-GH Paderborn, 1992.

[194] F. Muniz, E. Zaluska, “Parallel load-balancing: an extension to the gradient

model”, Parallel Comput. 21 (1995) 287–301.

[195] Lee YC, Zomaya AY, “Energy efficient utilization of resources in cloud computing

systems”. Journal of Super Computing, 60(2):268–280. doi:10.1007/s11227-010-0421-

3, 2012.

[196] A. Hameed et.al. “A survey and taxonomy on energy efficient resource allocation

techniques for cloud computing systems”, Springer, Journal of Computing, doi:

10.1007/s00607-014-0407-8, June 2014

Page 139: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

122

Author’s Publications

(Published)

1. N. Min-Allah, Hameed Hussain, S. U. Khan, and A. Y. Zomaya, "Power Efficient

Rate Monotonic Scheduling for Multi-core Systems", Journal of Parallel and

Distributed Computing, vol. 72, no. 1, pp. 48-57, 2012. Elsevier Journal. Impact

Factor: 1.078 (2010).

2. Hameed Hussain, S. U. R. Malik, A. Hameed, S. U. Khan, G. Bickler, N. Min-Allah,

M. B. Qureshi, L. Zhang, W. Yongji, N. Ghani, J. Kolodziej, A. Y. Zomaya, C.-Z.

Xu, P. Balaji, A. Vishnu, F. Pinel, J. E. Pecero, D. Kliazovich, P. Bouvry, H. Li, L.

Wang, D. Chen, and A. Rayes, "A Survey on Resource Allocation in High

Performance Distributed Computing Systems", Parallel Computing, vol. 39, no. 11,

pp. 709-736, 2013. Elsevier Journal. Impact Factor: 1.214 (2013).

3. Hameed Hussain et.al. “Load Balancing through Task Shifting and Task Splitting

Strategies in Multi-core environment”, Journal of Electronic Systems, vol 4, no 2,

June 2014, pp. 61-67.

4. Hameed Hussain, Muhammad Bilal Qureshi, Manzoor Illahi Tamimy, “Minimizing

Power Consumption through System Speed using Genetic Algorithm” , Accepted in

The Scientific World Journal (TSWJ). (Hindawi Journal) (2016)

5. Muhammad Zakarya, Syed Bilal Hussain Shah, Aftab Alam, Ateeq ur Rahman, Arsh

ur Rahman, Izaz ur Rahman, Ayaz Ali Khan, Hameed Hussain, Nazar Abbas, “An

Overview of New Ultra Lightweight RFID Authentication Protocol SASI”,

International Journal of Computer Science Issues (IJCSI) , Vol 8, Issue 2, pp.518-

524, March 2011, USA, ISSN (Online): 1694-0814.

6. Muhammad Bilal Qureshi, Maryam Mehri Dehnavi, Nasro Min-Allah, Muhammad

Shuaib Qureshi, Hameed Hussain, Ilias Rentifis, Nikos Tziritas et al. “Survey on

Grid Resource Allocation Mechanisms”, Journal of Grid Computing (2014): 1-43.

Springer, Impact Factor: 1.667 (2013)

7. Hameed Hussain, Maqbool Uddin Shaikh, Saif Ur Rehman Malik, “Proposed Text

Mining Framework to Explore Issues from Text in a Certain Domain”, IEEE

International Conference on Computer Engineering and Applications (ICCEA) , 19-

21 March, 2010, pp. 16-21, Bali Island, Indonesia, DOI: 10.1109/ICCEA.2010.11.

8. Hameed Hussain, Muhammad Bilal Qureshi, Muhammad Shoaib, Sadiq Shah, “Load

Balancing through Task Shifting and Task Splitting Strategies in Multi-core

environment”, 8th

IEEE International Conference on Digital Information

Management (ICDIM), 10-12 September, 2013, pp. 385-390, Marriot, Islamabad,

Pakistan. DOI: 10.1109/ICDIM.2013.6694040

9. Azra Shamim, Hameed Hussain, Maqbool Uddin Shaikh, “A Framework for Generation of

Rules from Decision Tree and Decision Table”, IEEE Internal Conference on Education

and Information Technology (ICEIT), 17-19 September, 2010, pp. 1-6, FAST University

Karachi, Pakistan. DOI: 10.1109/ICIET.2010.5625700

Page 140: prr.hec.gov.pkprr.hec.gov.pk/jspui/bitstream/123456789/7997/1/Hameed Hussain.pdf · iii Power Efficient Resource Allocation in High Performance Computing Systems ________________________________________________________________________

123

10. Sadiq Shah, Hameed Hussain, Muhammad Shoaib, “Minimizing Non-coordinated

Interference in Multi-Radio Multi-Channel Wireless Mesh Networks (MRMC-

WMNs)”, 8th

IEEE International Conference on Digital Information Management

(ICDIM), 10-12 September, 2013, pp. 24-28, Marriot, Islamabad, Pakistan. DOI:

10.1109/ICDIM.2013.6694017

11. Muhammad Shoaib, Nasru Minallah, Shahzad Rizwan, Sadiq Shah, Hameed Hussain,

“Investigating the impact of Group Mobility Models over the On-Demand Routing

Protocol in MANETs”, 8th

IEEE International Conference on Digital Information

Management (ICDIM), 10-12 September, 2013, pp. 29-34, Marriot, Islamabad,

Pakistan. DOI: 10.1109/ICDIM.2013.6694016