performance and power optimization of heterogeneous multi

Performance and Power optimization ofheterogeneous multi-cores

A Thesis ProposalSubmitted to

WAran Research FoundaTion(WARFT)

by

Vignesh AdhinarayananSenior Research Trainee

In Partial Fulfillment of the Requirements forResearch Awareness Program and Training (RAPT)

atWaran Research Foundation

Vishwakarma - High Performance Computing Group

WAran Research FoundaTion (WARFT)Chennai, India

URL: http://www.warftindia.orgemail: [email protected]

Phone: +91-44-24899766

December 2010

Performance and Power optimization ofheterogeneous multi-cores

Approved by:

Prof N.Venkateswaran, AdvisorFounder-Director, WARFT

Date Approved:

To

Prof. Waran, my Guru

iii

ABSTRACT

Conventional processor design methodologies involving evaluation of all

possible configurations of architecture exhaustively is not suitable for heterogeneous

architectures considering the large number of parameters in this case. When these ar-

chitectures are designed to suit a wide class of applications, the optimization of such

architectures becomes very important, unlike the case of general purpose processors.

An optimization methodology which effectively searches the design space in

order to find architectures which satisfy the power, performance or performance to

power requirements of these multiple applications is proposed here. The optimization

methodology employs Game Theory in view of the large number of the architecture

parameters to select a subset of parameters to be perturbed at different stages of the

optimization process, Simulated Annealing to find architectures that matches the input

specifications, and graph partitioning to form cores with highly communicating func-

tional units grouped together in order to minimize communication latency. Simulated

annealing involving multiple temperatures, which is expected to give better results con-

sidering the large design space of heterogeneous architectures, is yet to be implemented

to optimize the architecture.

iv

TABLE OF CONTENTS

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

I ORIGIN AND HISTORY . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Homogeneous multi-core processors . . . . . . . . . . . . . . . . . . 1

1.3 Heterogeneous multi-core processors . . . . . . . . . . . . . . . . . 2

1.4 Custom built heterogeneous multi-core architectures (CUBEMACH) 4

1.5 The need for optimization of custom-built heterogeneous multi-cores 5

1.6 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

II PROPOSED RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Performance modelling of heterogeneous multi-core architectures . . 7

2.2 Power modelling of heterogeneous multi-core architecture . . . . . 7

2.3 Custom Built Heterogeneous Multi-Core Architecture (CUBEMACH)Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Application of Game Theory to choose the parameters to beperturbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Application of simulated annealing for choosing the optimalsolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.3 Application of graph partitioning algorithm for heterogeneouscore formation . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 Application for multiple temperature simulated annealing foroptimization process . . . . . . . . . . . . . . . . . . . . . . 16

III WORK COMPLETED . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Performance modelling of heterogeneous multi-cores . . . . . . . . . 18

3.1.1 Analytical model of a heterogeneous core . . . . . . . . . . . 18

3.1.2 Analytical model of Compiler on Silicon (COS) . . . . . . . 18

3.1.3 Analytical model of On Node Network (ONNET) . . . . . . 22

v

3.1.4 Analytical model for Bandwidth of memory for heterogeneousmulti-cores . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Power modelling of Algorithm Level Functional Units . . . . . . . . 24

3.3 Optimization of heterogeneous multi-core architectures using singletemperature simulated annealing . . . . . . . . . . . . . . . . . . . 26

3.3.1 Experimental methodology . . . . . . . . . . . . . . . . . . 26

3.3.2 Results for single temperature simulated annealing process . 28

IV WORK TO BE COMPLETED . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Power model of CUBEMACH based architectures . . . . . . . . . . 33

4.2 Optimization of heterogeneous multi-core architectures using multi-ple temperature simulated annealing . . . . . . . . . . . . . . . . . 34

vi

LIST OF FIGURES

2.1 Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 General Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Multiple Objective Simulated Annealing [28] . . . . . . . . . . . . . . 14

2.4 Modelling Communication . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Performance of the architecture over the iterations of the optimizationprocess - Matrix Based Algorithms . . . . . . . . . . . . . . . . . . . 28

3.2 Rate of scheduling for every 1000 clock cycles for intial architectureand final architecture - Matrix based Algorithms . . . . . . . . . . . . 29

3.3 Performance of the architecture over the iterations of the optimizationprocess - Graph Based Algorithms . . . . . . . . . . . . . . . . . . . . 30

3.4 Rate of scheduling for every 1000 clock cycles for intial architectureand final architecture - Graph based Algorithms . . . . . . . . . . . . 30

3.5 Number of units in the initial and optimized architecture - Matrix andGraph based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Impact of core formation - Amount of data transferred before and aftercore formation by application of graph partitioning . . . . . . . . . . 32

4.1 Multiple Temperature Simulated Annealing . . . . . . . . . . . . . . 34

vii

CHAPTER I

ORIGIN AND HISTORY

1.1 Introduction

The computational demand of applications, such as brain modelling, weather mod-

elling, protein folding, astrophysics simulation etc., has been continuously increasing

over time. This demand led to the development of processors capable of delivering

higher performance. Initially, in order to improve the performance of processors, the

clock frequency of the processors was increased to a point supported by improve-

ment in technology. However, this resulted in increased power dissipation and design

complexity as well as reduced reliability [8]. As a result, the trend of increasing the

clock frequency saturated after clock frequencies of 4GHz (Intel Cancelled a 4GHz

Processor) [9] was reached. In order to meet the increased demands of the applica-

tions, multi-core processors were introduced [9] in which higher performance can be

extracted without increasing the clock frequency or the design complexity.

1.2 Homogeneous multi-core processors

Conventional approach to multi-core design involves replicating the same core

many times in a single processor. These cores have the same kinds of units and the

same number of units of every kind. They operate at the same clock frequency and

are capable of delivering the same performance. These cores are optimized to con-

sume minimal power during execution. In the case of such homogeneous multi-cores,

increase in performance is achieved by extracting parallelism out of applications and

executing the independent parts of the applications on these multiple cores concur-

rently. The performance achieved by these processors is increased when multiple

1

applications are run on them as more parallelism can be extracted from multiple ap-

plications.

These homogeneous multi-cores have been widely used. Some examples in-

clude: Pentium Dual Core processors [10] where two identical cores are used to

construct the processor. Intel’s approach to multi-core processors has been to use

replicated copies of the same core many times in order to improve the performance.

These cores are optimized for high performance and low power dissipation. Another

example is the Sun Microsystems’s UltraSPARC processor[12] which is available with

four, six and eight identical cores. The UltraSPARC processor is capable of running

many threads in a single core and when a long latency event such as cache miss oc-

curs, another thread is brought in for execution[11]. Other examples of homogeneous

multi-core processors include Intel’s Core i7[14] , AMD’s Quad Core Opteron[15],

IBM POWER[13] etc.

One of the main advantages for resorting to homogeneous multi-core proces-

sors is that the design turnaround time is very less. Since each of the cores in a

processor is identical, only one of these needs to be optimized for the required per-

formance & power and can be replicated which does not take much time. Also, it is

sufficient to test and validate just one core which again helps in bringing down the

design turnaround time. Another important advantage is that the cost involved in

implementing the processors is reduced. Since the cores are identical in all aspects,

the masks used in fabrication can be reused for all the cores. Thus the use of homo-

geneous cores brings down the turnaround time and reduces the manufacturing costs.

1.3 Heterogeneous multi-core processors

As multiple applications are being run on multi-core processors in order to im-

prove their performance, the usage of homogeneous multi-cores would mean that each

2

of these cores should be capable of executing all the applications. These applications

can have diverse needs: one of them may be predominantly based on matrix oper-

ations; another can be based on graph theoretic operations and so on. When using

homogeneous cores, the cores should have sufficient resources in order to execute all

these applications. This would result in over provisioning of resources and conse-

quently resource wastage since all the cores will be having all kinds of units and only

some of these units will be used depending on the application mapped onto the core.

Some examples of heterogeneous multi-cores include the IBMs Cell Broadband

Engine [16] [17]. It contains a main processor, called as the Power Processing Element

(PPE), eight co-processors called as the Synergistic Processing Element (SPE) and

the Element Interconnection Bus (EIB) [17] [18]. The Instruction Set Architecture of

PPE’s and SPE’s are different and they have different functionalities. Another exam-

ple of the heterogeneous multi-cores is the GPUs. The nVidia Fermi Architecture[19]

consists of four specialised cores. These cores contain units for sine, cosine, reciprocal

and square root operations. All the other cores are identical in nature.

The disadvantages when it comes to heterogeneous cores are the increased

cost of fabrication as each of these cores are different in nature, increased design

turnaround time as each of the cores have to be designed separately & optimized for

the necessary performance and power requirements and then tested and verified to

ensure their proper functionality. However, the implementation cost can be brought

down by resorting to a cell based approach where a limited set of cells are used.

Another way of reducing the cost factor involved in designing heterogeneous archi-

tectures is by using SCOC IP Cores [21] (SuperComputer On Chip IP Core). The

advantages offered by heterogeneous multi-cores include better resource utilization

[22] as applications or parts of application can be mapped on to the cores which best

fit them, making use of the resources available efficiently. Better resource utilization

also means that the power consumed in executing the applications can be minimized

3

as there are fewer units that are idle, thereby consuming lesser static power.

1.4 Custom built heterogeneous multi-core architectures (CUBE-MACH)

When heterogeneous cores are being used, the design of such architecture should

reflect the needs of the multiple applications being run on them concurrently. The

CUBEMACH (Custom Built Heterogeneous Multi-core Architectures) design paradigm

offers the possibility of increased resource utilization by running multiple applications

simultaneously[23] without space time sharing. The heterogeneous architectures

based on the CUBEMACH design paradigm[24] are designed to suit the multiple

applications that are to be run on the architecture. These architectures employ a va-

riety of Algorithm Level Functional Units (ALFU)[25] apart from scalars in order to

meet the performance requirements of the applications. Some examples of the ALFUs

include 2X2 Matrix multiplication units, 2X2 Crouts Units, Graph Traversal Units,

Sorter Units etc. These Algorithm Level functional units are driven by algorithm

level instructions. These algorithm level instructions constitute the ISA for the ar-

chitectures based on the CUBEMACH design paradigm and this ISA is called as the

Algorithm Level Instruction Set Architecture (ALISA)[25].The architectures designed

based on the CUBEMACH design paradigm use a cell based approach implemented

on SCOC IP Cores to minimize cost[21]. The On-Node-Network (ONNET)[25] based

on Multistage Interconnection Network (MIN) takes care of the communication both

within the cores and across the cores. The Compiler-On-Silicon (COS)[25] takes care

of the compilation and scheduling of the compilation and scheduling. Hardware based

approach is used to meet the high performance requirements.

4

1.5 The need for optimization of custom-built heterogeneousmulti-cores

Conventional processor design methodologies involve exhaustive design space ex-

ploration in which all possible processors that could be built from the design space is

considered and simulated in order to study how these processors perform for different

sets of inputs[4]. From the set of all possible processors from the design space, a

single processor, which comes closest to satisfying the requirements of the user of the

system, is selected on the basis of the results of simulations and is implemented. This

approach to design processors had no problems until heterogeneous multiple cores

came into being [19] [16], in order to cater to the diverse computational needs and

the power requirements of the Grand Challenge Applications. The usage of heteroge-

neous multiple cores made the conventional processor design methodology obsolete as

the number of architectural parameters to be considered in the design space becomes

very high, thereby making the design space larger. This means that, the number of

possible processors from the design space increases exponentially[5]. As a result, the

time taken to explore the design space exhaustively also becomes very high. So the

need for effective design space exploration in order to prune the search space arises.

The need for optimization in general purpose processors in comparatively lesser

as these processors are not specialized for running any particular application or a class

of applications. In Application specific processors it becomes important to tune the

processor to suit that particular application running on the processor. Optimizing

the processor to suit that application becomes critical. CUBEMACH based proces-

sors, which are not general purpose processors, favours running a wider class of on the

same node without time or space sharing. It suits a class of applications. These appli-

cations can have different requirements (performance, power, performance to power

ratio etc.). The architecture design has to reflect the requirements of these individual

applications. If these applications belong to different users, then optimization should

5

be carried out to satisfy the requirements of the different users. Otherwise, if the

applications belong to a single user the optimization has to be carried out to satisfy

the requirements of the different applications. So the need for optimizing the archi-

tecture to suit the multiple applications requirements arises.

1.6 Overview of Thesis

The optimization methodology proposed here explores a pruned design space in

order to find an architecture that meets the performance, power or performance to

power ratio requirements of the multiple applications, which are to be run on the

architecture designed based on the CUBEMACH design paradigm. Game Theory[3]

is employed by the optimization process considering the vast number of intricately

related architecture parameters. It selects a subset of architecture parameters every

iteration whose values are changed. Simulated annealing[1] is employed to explore

the search space in order to achieve the multiple objectives which are the individual

applications requirements. The optimization technique makes use of Kernighan-Lin

Algorithm[2] for core formation to group ALFUs into population[24] and population

into cores in order to minimize communication across populations and cores which

reduces the execution time of an application.

6

CHAPTER II

PROPOSED RESEARCH

2.1 Performance modelling of heterogeneous multi-core ar-chitectures

An analytical model of a system is used to predict the performance of the system

for some applications. The system is represented making use of the various archi-

tecture parameters of the system. The analytical model of the system is the set of

equations, whose variables are the architecture parameters, which describe the perfor-

mance of the system. Thus, this set of equations brings about the relationship among

the various parameters of the system and can be used to predict the performance of

the system under ideal conditions. We form equations, relating the various architec-

ture parameters with each other after analyzing the CUBEMACH based architecture.

2.2 Power modelling of heterogeneous multi-core architec-ture

The architectures based on CUBEMACH design paradigm are built from MIP cells

which are the basic building blocks. The power modelling of CUBEMACH based ar-

chitectures will involve the power modelling of the various units in the architecture

which are built from MIP cells, modelling the power consumed by memory and the

modelling the power consumed by interconnects and bus lines. Cells are logically

equal to CMOS complex logic gates and their variants. Hence, it is sufficient to ob-

tain models of dynamic, static and short-circuit powers for the individual MIP Cell

[4] and further extrapolate the models to estimate the total power consumed by the

architecture.

7

2.3 Custom Built Heterogeneous Multi-Core Architecture(CUBEMACH) Optimization

The Optimization engine for CUBEMACH is shown below. The requirements

of the different applications belonging to the same or different user are given as

input to the optimization engine. Values of the various architectural parameters

are assigned based on the analytical model of the architecture and the application

to be run. The Game Theory engine selects a subset of the parameters from the

set of all architecture parameters. This subset of parameters is selected based on

some heuristics and only those parameters which are expected to significantly affect

the result are selected. Simulated Annealing is employed as the global optimization

process and the values of the selected parameters are varied. This new architecture

derived is simulated and from the results obtained the communication pattern is

obtained and modelled as a graph. The KL graph partitioning algorithm is employed

to group the highly communicating ALFUs into a single population. The resulting

architecture is simulated again and the performance to power ratio obtainable from

this architecture is calculated. This value is compared with the required performance

to power ratio. When the input requirements are met the optimization process stops

and the solution is obtained.

2.3.1 Application of Game Theory to choose the parameters to be per-turbed

The implementation of Game Theory is tightly integrated with the implementa-

tion of Simulated Annealing. Game theory has been employed to select a subset of

the architecture parameters whose value has to be changed in every iteration of the

simulated annealing process to determine the neighbour state as explained in next

section.

8

Game Theory

Simulated Annealing

Simulation

Core Formation

Simulation

Near Optimal Solution

Architecture Parameters

Desired Performance

Obtained Performance

Figure 2.1: Optimization Flow

The objective of the game is to reach an architecture state, in which all the

user requirements (performance, power, performance to power ratio) for individual

applications are satisfied. The players of the game are the various architecture pa-

rameters. This application is a case of cooperative Game Theory in which the players

of the game work together in order to achieve a common goal, the goal being reaching

architecture state where requirements (performance, power etc) are met.

The parameters of the CUBEMACH are intricately interrelated. Selection of

one parameter by the Game theory engine might not show an improvement in per-

formance unless its related parameters are also selected and their values modified.

For example, increasing the number of ALFUs does not show an improvement in

performance unless the scheduler has been modified to be powerful enough to deliver

instructions at a rate to cater to increase in ALFUs. Also, when one player (an ar-

chitecture parameter) is selected to play the game (achieving the multiple objectives)

9

in that iteration, there is another player (parameter) interacting with the first men-

tioned player, which could give better results. The Game theory Engine ensures that

an appropriate subset of the parameters is chosen that helps in quickly achieving the

goal of the game. Also the Game Theory has an arbitrator to ensure that the value by

which a parameters value is changed is within acceptable limits and an unrealizable

architecture doesnt result at the end of the optimization process. Another purpose

of the arbitrator is to ensure that the values of the parameters do not conflict with

each other.

The implementation is as follows. Every parameter (the player in the GT)

is initially assigned some rank. The Game theory selects a subset of the parameters

whose value is changed by some acceptable value (ensured by the arbitrator) based on

the rank of the parameters and their dependency. The resulting architecture is simu-

lated. If there is an appreciable improvement in the performance as compared to the

previous architecture state, then the subset of parameters selected have contributed

positively and hence their ranks are increased and if there is not much change in

the performance, then their ranks are decreased. In every iteration, the game theory

selects parameters whose rank has a higher value and is expected to significantly af-

fect the performance of the architecture. Also when the subset of the parameters is

selected, those parameters which are closely related with the subset parameters are

also included in the subset. This is achieved making use of an analytical model of the

CUBEMACH architecture which brings about the dependency across the parameters.

Another variation in implementation includes assigning an initial rank to the various

teams of the game. These teams include a subset of the entire parameter set, which

are the players of the game. The parameters within the team are also assigned ranks.

In each iteration, some teams are selected to play the game. Each team contains

parameters which are closely related to each other. From the teams selected, a subset

10

of players are selected every iteration based on the degree of closeness across parame-

ters. The rank of the players within the team shows how much the player affects the

optimization process within the team. So, in every iteration, the game theory selects

teams which best affect the optimization process and from the team subset, closely

related players are selected whose values are changed to determine the neighbour state

of simulated annealing as explained in next section. Thus the usage of Game Theory

Engine in the Optimization Engine ensures that only the values of those parameters

which are expected to significantly affect the performance are changed.

2.3.2 Application of simulated annealing for choosing the optimal solu-tion

Simulated annealing derives its name from the metallurgical process annealing in

which a metal is heated to a high temperature and cooled very slowly in order to

remove its defects. The heat causes its ions to move freely to states with higher en-

ergy and the slow cooling results in reaching configurations which have lower energy

than the initial state i.e. the structure of the metal is reformed, first by heating it

to a very high temperature and then cooling it slowly in order to remove its defects.

Similarly, in simulated annealing, we start at a high temperature T, where even poor

solutions (architecture states far from the required solution) are accepted often, and

as the temperature is reduced slowly, only those solutions which are better than the

state from the previous iteration are accepted.

An architecture state has a value assigned to every architectural parameter

from the design space. Every state has an energy Estate associated with it. This

energy is the value of an objective function where the objective can be to minimize

power, or maximize performance, or maximize the performance to power ratio. The

value of energy is such that by minimizing the value, the objective is achieved. Here,

we try to optimize the architecture with respect to performance to power ratio and

11

hence, the energy Estate is a measure of performance to power ratio. In every itera-

tion, the value of certain parameters from the previous iteration is changed, resulting

in a neighbour state which has an energy Estate+1 associated with it. The parameters

whose value is to be changed are selected by applying Game Theory as discussed in

previous section. ∆ E = Estate+1 - Estate is the difference between these two states. If

∆ E is lesser than 0, it means that we have got a better solution and the neighbour

state is be accepted. However a worse solution is also accepted sometimes in order

to ensure that the process does not get stuck in local minima. To determine which

of the worse solutions can be accepted we calculate two probabilities P1 and P2 as

shown below.

P1 = e−∆E/kT

P2 = random (0, 1)

Where k is the Boltzmann constant,

T is the temperature of the process.

When the probability P1>P2 a worse solution is accepted. As T reduces

slowly over a period of time, the value of E/kT increases in general, and hence the

value of P1 decreases. When the value of P1 decreases, only a very few number of

poor solutions are accepted in the later stages of Simulated annealing process.

The rate at which T is reduced is important. The rate at which the temper-

ature is reduced is determined by a process known as temperature scheduling. The

success or failure of the Simulated Annealing process is very much dependent on the

Temperature scheduling. If the temperature is reduced from an initially very high

value Tstart to 0K over infinite iterations, it is assured that a global minimum is

reached, proof of which can be found in [7] and [6]. However, it is not practical to

have infinite iterations.

If the cooling takes place at a fast rate, there is no guarantee for the process

12

to converge to a global minimum and the process becomes a simulated quenching.

The temperature has to be lowered at a rate such that convergence to a global min-

imum is still possible. The Simulated Annealing process is continued until the user

requirements are met with. If an unreasonably high goal (in this case, performance to

power ratio) is required, which cannot be achieved, the process continues till a global

minimum is reached or in other words, the process continues till the solution which

comes closest to meeting the user objectives/requirements is reached.

S ← Initial State S0

T ← Initial Temperature T0

while (termination criteria is not met) or (T!=Tend) doS’← Neighbor of S∆E ← ES – ES’

If ∆E < 0S ← S’

elseT ← UpdateTemperature()P1 ← e-∆E/kT

P2 ← random (0,1)If P1>P2

S ← S’Output Send

Figure 2.2: General Simulated Annealing

Another slight variation can be used when the CUBEMACH architecture needs

to run multiple independent applications belonging to a single user or many differ-

ent users without space sharing and time sharing. These multiple applications can

have different requirements and different priorities. For example, application 1 may

have a high performance requirement; application 2 may have a requirement that it

should consume as little power as possible during execution, though it can take a

long time to finish execution and so on. In this case the architecture has to reflect

the requirements of the multiple applications. In such a case while calculating the

13

value of E, we consider the energy difference for each application separately. The

energy difference for application 1 is based on the performance; energy difference for

application 2 is based on power and so on. The weighted sum of these differences for

individual applications is taken as the value of E. The value of the weights in this

sum is dependent on the priorities of the application, with higher weights applied

to applications with higher priorities. The priority of the applications, on the other

hand, can be determined by the cost (money) which the user shares, i.e., fraction

of the money contributed by the user spends on building the whole CUBEMACH

architecture. This way the architecture favours the application with a higher priority,

while satisfying the individual requirements of the multiple applications.

S ← Initial State S0

T ← Initial Temperature T0

while (termination criteria is not met) or (T!=Tend) doS’← Neighbor of S∆E1 ← E1S – E1S’

∆E2 ← E2S – E2S’

∆En ← EnS – EnS’

∆E ← w1 ∆E1 + w2 ∆E2 + …. + wn ∆En

If ∆E < 0S ← S’

elseT ← UpdateTemperature()P1 ← e-∆E/kT

P2 ← random (0,1)If P1>P2

S ← S’Output Send

Figure 2.3: Multiple Objective Simulated Annealing [28]

2.3.3 Application of graph partitioning algorithm for heterogeneous coreformation

The structure of ONNET (the communication backbone of CUBEMACH) is such

that the latency involved in transferring data from an ALFU in one population to

an ALFU in another population is greater than the latency involved in transfer of

14

data between ALFUs belonging to the same population. Therefore, if highly com-

municating ALFUs are in the same population, then the latency involved in transfer

of data from one ALFU to another is greatly reduced. Transfer of intermediate data

from one ALFU to another takes place often in higher end applications. So suitable

grouping of ALFUs greatly improves the performance of the system.

MATMUL ALFU

MATADD ALFU

MATMUL ALFU

LUD ALFU

1.4 KB

0.3KB

0.3 KB

1.4 KB

Node representing the ALFU

Edges representing the bytes transferred across

Figure 2.4: Modelling Communication

The amount of data transferred between all pairs of ALFUs in bytes is available

from the simulation results. This data can be used to group highly communicating

ALFUs suitably. The simulation results are made use of to construct a graph theoretic

model of the communication pattern. The nodes of the graph are the individual

ALFUs in the architecture. The edges represent that the nodes connected by an

edge communicate with each other by transfer of intermediate data. The edge weight

is the amount of data transfer in bytes between the two nodes connected by the

edge. The Kernighan-Lin graph partitioning algorithm can be applied on this graph

to form sub-graphs whose external cost is minimal. Each sub-graph represents a

population and the internal cost (the sum of edge-weights for edges within the sub-

graph) represents the amount of data transfer within the population and external

cost is the amount of data transfer across population. So, by minimizing the external

15

cost, we limit the number of bytes transferred across population, thereby improving

efficiency. Similarly, each population can be modelled as a node and the edges can

represent data transfer among populations with the edge weight being the amount of

data transferred across the nodes connected by the edge. The KL graph partitioning

algorithm can be applied on this graph to group population into cores.

2.3.4 Application for multiple temperature simulated annealing for opti-mization process

Another approach to perform the optimization is to use multiple temperature simu-

lated annealing in place of single temperature simulated annealing. In this approach,

we do not use the same acceptance probabilities for all the parameters. This work

is to be taken next and using multiple temperatures is expected to handle the large

number of parameters in the case of heterogeneous multi-cores better.

It may be seen during the optimization process that some parameters would

have reached their optimal value much earlier than other parameters. That is, some

parameters could have been explored more than the other parameters. Suboptimal

values of these parameters could have been explored and then subsequently accepted

or rejected by the optimization technique. At later iterations, the value to be fixed up

for these parameters would be known approximately and hence many poor solutions

are rejected. However, there could be some other parameters which have not been

explored sufficiently in the initial stages. If the values of these parameters were to be

changed at later stages of the optimization process, when the temperature is very less,

and it is noticed that a poor solution has been arrived at, then these values will be

rejected. So, these parameters are not sufficiently explored and their optimal values

are not arrived at. This situation could occur often especially when the number of

parameters is very high.

The approach proposed here is to use different temperatures for the simulated

annealing for different teams. Closely related parameters can be grouped together

16

as a team and the acceptance probabilities for each team can be selected based on

their temperatures. So, even if some parameters are not selected initially in the op-

timization process, their temperatures would not have reduced if their teams are not

selected. This ensures that the parameters selected later are also sufficiently explored.

17

CHAPTER III

WORK COMPLETED

3.1 Performance modelling of heterogeneous multi-cores

The analytical model for CUBEMACH based architecture[24] is as follows:

3.1.1 Analytical model of a heterogeneous core

The various parameters of the core include the number of functional units within

the cores and also their types. Some of the parameters of the core and node are

denoted as

Mp = Number of PCOS inside a node

Ma = Number of ALFUS under 1 SCOS

Ms = Number of SCOS under 1 PCOS

Total number of ALFUs in a core= Mp*Ma*Ms

The size of the population buffer is given by,

Population Buffer Size = No. of ALFUs in a population * MAX(Latency of memory,

Latency of ONNET) + Best case latency of ALFU)

3.1.2 Analytical model of Compiler on Silicon (COS)

The various parameters of the Compiler-On-Silicon include the size of the various

tables in the COS[27], which would determine the number of entries in the tables

of the COS and thereby affecting the scheduling rate, the bandwidth of the SCOS,

PCOS etc. The equations which determine the bandwidth of the SCOS, PCOS and

the size of the various tables, the size of the libraries etc are as follows:

18

3.1.2.1 Analytical model of Secondary Compiler On Silicon (SCOS)

Suppose the required performance is T Teraops and the input contains all types

of instruction from the Instruction Set Architecture. Assuming that the architecture

contains N types of ALFUs, each for executing one type of instruction from the ISA,

if there are a instructions for being executed in ALFU of type 1, b instructions for

being executed in ALFU of type 2 and so on. If there are ALFU1 number of ALFUs

of type 1, ALFU2 number of units of type2 and so on.

The various symbols used here are

C = Maximum number of Control Words that can be formed from 1 Sub Library

K = Maximum number of Sub-libraries input to 1 SCOS in 1 second

1 Library = Smax Sublibraries

(i.e. 1 Library can consist of a maximum of S Sublibraries )

Ni = Number of Instruction in a Control Word

No. of ops/ALFU t = T / Total number of ALFUS = T/(Mp*Ms*Ma) ... (1)

No. of control words/ALFU = t/Ni Control Words-per-second per ALFU ... (2)

Peak BW of 1 SCOS = I/p to Ma number of ALFUs/sec

= t * Ma / Ni ControlWords-per-second [from (2)]

Peak BW of all SCOS = Peak BW of 1 SCOS* Ms

Expected BW of all SCOS =

Sustained resource utilization of all ALFUs * peak BW of SCOS ... (3)

Rate of Input to SCOS (in control words) = K*C ... (4)

Under, ideal conditions, rate of Input of SCOS, in terms of control words, will be

19

equal to its output, in terms of control words. So, from (2) and (4),

K*C = t*Ma / Ni

From above, K = t*Ma / (Ni * C) Sub Libraries ...(5)

The above equation gives the maximum number of sub-libraries that must be sched-

uled to the Secondary Compiler On Silicon (SCOS) under ideal conditions.

Processing capacity of SCOS =SCOS output rate

No. of instructions per sub-library

= K*C*NC*Ni

= K

The processing capacity of SCOS is in terms of sub-libraries and the SCOS o/p rate

is in terms of instructions.

The size of one entry in the sub-library detail table is given by,

Sub-library detail table Size = K * SizeSDT

SizeSDT = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of

sub-libraries/lib)]/8+ 2*log2(logical memory size) +1

The instruction detail table should hold all the instructions of at least 1 sub-library.

So the minimum size of the instruction detail table is given by,

Instruction Detail Table Size = Maximum size of a sub-library * SizeIDT

= C * Ni * Size IDT

SizeIDT in bytes = [ log2(No. of applications) + log2(No. of libraries/app) +

log2(No. of sub-libraries/lib) + log2(C*Ni ) + log2(No. of instructions in ISA)]/8+

4*log2(logical memory size) +2

20

3.1.2.2 Analytical model of Primary Compiler On Silicon (PCOS)

The equation describing the peak bandwidth of PCOS is given by

Peak BW of PCOS = Number of SCOS under a PCOS * Input to 1 SCOS

= Ms * K Sub libraries

= Ms * (t *Ma/(Ni*C)) [refer 5] ... (6)

Rate of Input to PCOS in terms of Libraries = Ms * K / S ... (7)

The above equation gives the maximum number of libraries which should be scheduled

to the PCOS under ideal conditions.

3.1.2.3 Analytical model of the scheduler

The functional Unit status table and the functional unit utilization table is used

by the schedulers to decide the ALFU to which an instruction must be scheduled to.

The functional Unit status table contains the status information of all the ALFUs

within the core. So the number of entries is equal to the number of ALFUs in that

core. The same applies for Functional Unit Utilization table as well.

Functional Unit Status table Size = No. of ALFUs /core * SizeFUS

= Mp*Ms*Ma * SizeFUS

Functional Unit Utilization table Size = No. of ALFUs / core * SizeFUU

= Mp*Ms*Ma * SizeFUU

SizeFUS in bytes = 4 + log2(Mp*Ms * Ma)

SizeFUU in bytes = 1 + log2(Mp*Ms*Ma)

The minimum size of the Buffer detail table is determined by the following

equation:

Instruction Buffer Table Size = Scheduler latency * No. if ALFUs/core * SizeIBT

= Scheduler Latency * Mp*Ms*Ma * SizeIBT

21

SizeIBT = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of sub-

libraries/lib) + log2(C*Ni ) + log2(No. of instructions in ISA) + log2(Mp*Ms*Ma)]/8+

4*log2(logical memory size) +2

The size of the SCOS output buffer is given by,

SCOS o/p buffer size = Population buffer size * Number of ALFUs in a population

= Population buffer size * Ma

Primary Compiler On Silicon Scheduler

The size of the various tables required for the PCOS to schedule the sub-

libraries to the SCOS is given by

Library Address Table Size = No. of libraries scheduled to the PCOS * SizeLAT

Library status table size = No. of libraries scheduled to the PCOS * SizeLST

SizeLAT = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of

sub-libraries/lib)]/8 + 2*log2(logical memory size)

SizeLST = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of

sub-libraries/lib)]/8 +2

3.1.3 Analytical model of On Node Network (ONNET)

The various parameters of the On-Node-Network include the number of buffer

stages in the ONNET[26], the number of MIN stages, the number of ports etc. These

parameters decide the bandwidth of the ONNET. When the bandwidth is low, then

the resources would be under-utilized as the functional unit would be waiting for the

ONNET to deliver the data without executing the instructions scheduled to them.

The various equations concerned with the On-Node-Network are as follows:

22

Peak Processing Capacity of SLR per Clock =

No. of alfus in pop∑i=1

(ki + li) =

No. of alfus in pop∑i=1

Ti ∗ (ki + li) (3.1)

Where,

Ti = No. Of ALFUs of type T in population

Ki = Number of input bits to an ALFU

Li = Number of output bits from an ALFU + Packetizing bits

Li = Ki + 2 log2 (64)

where, 2 log2 (64) is the number of bits used for packetizing.

Peak Bandwidth = Processing Capacity / PdSLR

Where, PdSLR = Pipeline Delay of Sub local router

Resource utilization, R=No. of Clock Cycles in opearation

Total number of clock cycles

Effective Resource utilization : Re=Total no. of clocks∑

i=1

No. of instructions in pipe at Ci

Pipeline stages× 100

Total no. of cycles of execution

Expected bandwidth = Re * peak BW of Population

Peak BW =QPs

=

No. of alfus in population∑i=1

Ki + Li

Minimum(P1 ,P2, . . . )

Here P1, P2, P3, P4 are the pipeline delay of each of the ALFUs in a population,

the minimum of which is chosen as the pipeline delay of the sub local router.

Q = Total output from Sub Local Router

For max value, Q =No. of ALFUs∑

i=1

No. of ports per ALFU * bytes per clock

23

SLR latency = MIN Latency + Packetization Latency

+ (No. Of buffers*Buffer Delay)

MIN Latency = Switching Latency * No. Of Stages

Packetization Latency = Data SizePacketsize

∗ time taken for single packet

Packet Drop Probability = (1 − No. of bufferNo.ofPackets/Clock

) ∗ 100

Maximum ALFU Data size = Clock Ratio of MIN with ONNET * Buffer Bytes

* Number of Ports

3.1.4 Analytical model for Bandwidth of memory for heterogeneous multi-cores

The bandwidth of memory is given by the following equation

Bandwidth of Memory =Portmax∑

i=1

W ∗ R*Pr*Ph

Min Pipeline delay

Where,

Portmax = Maximum number of ports

R = Resource Utilization

Pr = Probability of read

Ph = Probability of hit

W = Size of a word in bytes

All these equations together would determine the performance of the CUBEMACH

design paradigm based architecture under ideal conditions.

3.2 Power modelling of Algorithm Level Functional Units

The CUBEMACH architecture consists of ALFUs where the basic computational

units are the MIP Cells[29][30]. MIP cells are logically equal to CMOS complex logic

gates and their variants. Hence, it is sufficient to obtain models of dynamic, static

24

and short-circuit powers for the individual MIP Cell and further extrapolate the mod-

els to estimate the ALFUs total power. The dynamic power dissipation is obtained

by estimating the number of power consuming transitions for each MIP Cell under

consideration. The load capacitance of a MIP Cell is determined by the number of

fan-out MIP Cells.

The analytical equation used to calculate the dynamic power dissipation is given by

Pdyn = Cload ∗ V 2DD ∗ activityfactor ∗ frequencyofoperation

where,

VDD = Supplyvoltage

Activity factor =number of power dissipating transitions

total number of transitions

Frequency of operation = ALFU clock rate

Cload is the number of fan-out MIP Cells

To calculate short circuit power dissipation, the MIP Cells were modelled as a single

equivalent inverter and the input signals to the MIP cells was also modelled as a single

equivalent input signal[32], the input rise time and the output fall time, which affect

the transistor peak current, were assumed to be equal so that the power dissipation

due to direct path current is minimized[31].

The analytical expression used to estimate the direct path current is given by

Pdp = (Cshort−circuit + Cload) ∗ V 2DD ∗ frequencyofoperation

Where,

Cshort−circuit = tsc ∗ Ipeak ∗ VDD

As mentioned before modelling the static power dissipation is important in the design

phase of the architecture. Static current in a CMOS device is due to the reverse biased

junctions of the transistors when it is in the off mode and also due to sub-threshold

current as discussed before. The analytical expression used to calculate the static

power is given by,

Pstatic = IStat ∗ VDD

25

Where,

Istatic = Ileakage + Isub−threshold + IBandtoBandtunnelrr

is the static current of the transistor

The values of the leakage, static current and threshold voltage and the supply voltage

considering the 45nm technology were obtained from ITRS. The total power dissipa-

tion is the sum of the dynamic, static and direct path current dissipations.

Ptotal = Pdyn + Pstatic + Pdirect−path

With these equations, the total power dissipation of a MIP Cell is obtained. With

these values, the total power dissipation of the ALFUs is calculated by obtaining

equations for each type of ALFU which gives the number of MIP cells of each types

with varying problem size and the size of data.

3.3 Optimization of heterogeneous multi-core architecturesusing single temperature simulated annealing

3.3.1 Experimental methodology

We implemented a clock-accurate full system simulator which simulates the sched-

uler, the functional units, the backbone network architecture and the memory. The

scheduler is implemented in the simulator as a hardware scheduler (such as in [16])

for scheduling the instructions to the various functional units. The functional units in

the architecture includes some basic scalar units such as adders, multipliers, dividers;

matrix units such as matrix adders, matrix multipliers, matrix invertors, crouts units

all of which solves problems of size2X2 where these units are built from basic scalar

units; graph theoretic units including Graph Traversal Units and Sorters, which are

built from basic scalar units. A hierarchical network employing the omega Multistage

Interconnection Network (MIN) structure was included in the simulator. The hier-

archical network is used for data transfer within the core as well as across cores. A

subset of the functional units within the core has a private L1 cache, and each core

26

has its own L2 cache, and the L3 cache is common to all the cores. The simulator

is clock driven. After all the units in the architecture receive the clock signal, they

finish their execution for that clock cycle. After all the units finish their execution

for that particular clock cycle, the next clock trigger signal is sent to all the units for

execution in the next clock cycle.

The values of the various architecture parameters such as the number of parallel

schedulers, the kinds of the functional units in the architecture and the number of

functional units of each kind, the buffer sizes for the hierarchical network, the number

of ports in the MIN structure, the size of L1 cache, L2 cache and the L3 cache etc.

for each of the cores are written in configuration file. These values are read by the

clock-accurate simulator to start the optimization process. The architecture which

corresponds to the parameter values in the configuration file is simulated and the

results of the simulation such as the performance of the architecture, the resource

utilization of the functional units, data transferred across the cores etc. are obtained.

The optimizer updates the values of the various parameters into the configuration

file according on the proposed technique. When the input specifications are met, the

optimization process is stopped and the final value of the parameters is taken as the

near optimal solution.

For testing the optimizer, we generated our own complex workload with high

computation and communication complexity. A variety of algorithms were used in

generating these workloads. The results are shown for two workloads; one which

is computationally intensive making use of matrix based algorithms such as matrix

multiplication, LU-Decomposition and matrix inversion predominantly whereas the

other workload consists of algorithms such as sorting, graph traversal and convex-

hull. These algorithms are widely used in scientific and commercial applications and

characterize their workloads.

27

3.3.2 Results for single temperature simulated annealing process

The results are shown for two workloads. The first workload is composed mainly

of matrix based algorithms such as matrix multiplication, matrix inversion, LU-

Decomposition etc.

The second workload is composed mainly of graph based algorithms such as graph

traversal, sorting and convex-hull. The results shown are for 4-core architecture with

each core consisting of 32 functional units with the functional units operating at a

frequency of 800MHz.

0

50

100

150

200

250

300

350

400

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Iterations

Per

form

ance

in G

FL

OP

S

Performance of Architecture over the iterations

Figure 3.1: Performance of the architecture over the iterations of the optimizationprocess - Matrix Based Algorithms

Figure 3.1 shows the performance variation over the iterations of the optimization

process for workload 1, which largely consists of matrix based algorithms. The input

specifications demanded a performance of 500 GOps from the architecture. A maxi-

mum performance of 375 GOps was achieved from the optimization process as shown

in Fig. 3.1 before the process terminated. The initial architecture had a resource

utilization of about 29% which increased to over 42% as a result of the optimization

process. One reason for the low resource utilization from the optimization process is

the high dependency in the instructions of the workload generated. Another reason

is that a single temperature based simulated annealing is not sufficient for optimizing

28

0

10

20

30

40

50

60

70

044000

88000

132000

176000

220000

264000

308000

352000

396000

440000

484000

528000

572000

616000

660000

704000

748000

792000

836000

880000

924000

968000

Aver

age

Sch

edu

ling

rat

e/

tho

usa

nd

clo

ck c

ycl

es

Scheduling Rate of Initial and Optimized Architecture

Initial Architecture

Optimized Architecture

Clock Cycle

Figure 3.2: Rate of scheduling for every 1000 clock cycles for intial architecture andfinal architecture - Matrix based Algorithms

such a heterogeneous architecture with very large number of parameters. For this rea-

son we propose the usage of multiple temperature simulated annealing to handle the

large number of parameters which is to be taken up as a future work. The scheduling

rate for the initial architecture and the optimized architecture for workload 1 is shown

in Fig. 3.3 for 1 million clock cycles. It can be seen from the graph (Fig. 3.3) that

the scheduling rate has increased for the optimized architecture. This is because the

number of matrix based units has increased in the final optimized architecture which

can be seen from Figure 3.5.

Figure 3.2 shows the performance variation over the iterations of the optimization

process for workload 2, which largely consists of graph based algorithms. A perfor-

mance of 475 GOps was achieved as a result of the optimization process as shown in

figure 3.2 where the input specification was 750 GOps. For this workload, a resource

utilization of 30% was achieved in the optimized architecture from a value of 18%

in the initial architecture. The low resource utilization is a direct consequence of

the high level of dependency among the instructions in the workload and the fact

that a single temperature based simulated annealing process is not sufficient for the

large number of parameters in heterogeneous architectures. The scheduling rate has

29

Iterations

Per

form

ance

in G

FL

OP

S

Performance of Architecture over the iterations

0

50

100

150

200

250

300

350

400

450

500

1 3 5 7 9 11131517192123252729313335373941434547495153555759

Figure 3.3: Performance of the architecture over the iterations of the optimizationprocess - Graph Based Algorithms

0

5

10

15

20

25

30

35

40

45

50

039000

78000

117000

156000

195000

234000

273000

312000

351000

390000

429000

468000

507000

546000

585000

624000

663000

702000

741000

780000

819000

858000

897000

936000

975000

Initial Architecture

Optimized Architecture

Clock Cycle

Aver

age

Sch

edu

ling

Rat

e /

tho

usa

nd

clo

ck c

ycl

es

Scheduling Rate of Initial and Optimized Architecture

Figure 3.4: Rate of scheduling for every 1000 clock cycles for intial architecture andfinal architecture - Graph based Algorithms

increased to an average of 38 instructions per clock cycle from an initial value of 22

instructions per clock cycle for the starting architecture. This is mainly because of

the increased availability of suitable functional units which can be seen from Figure

3.5. Most instructions are dependent on the remaining smaller set of instructions

in the workload. This explains the scheduling rate of the optimized architecture in

which the number of instructions scheduled increases and decreases alternately. The

dependent instructions are executed in parallel in large numbers after their parent

30

instructions finish execution in the case of workload 2.

0

20

40

60

80

100

120

140

Initial

Architecture

Initial

Architecture

Optimized

Architecture

Optimized

Architecture

Matrix Based Algorithms

Graph Based Algorithms

Graph Theoretic Units

Scalar Units

Matrix Units

Figure 3.5: Number of units in the initial and optimized architecture - Matrix andGraph based algorithms

Figure 3.5 shows the number of functional units in each kind in the 4-core archi-

tecture. The functional units are classified as scalars, matrix units and graph units.

For the workload 1, consisting of algorithms such as matrix multiplication, matrix

inversion and LU-Decomposition the number of matrix units increased after the opti-

mization process. For the workload 2, consisting of algorithms such as Convex-Hull,

Sorting and Graph Traversal, the number of graph units increased after the opti-

mization process. This contributed to the increase in performance (Fig. 3.1 and Fig.

3.2).

Figure 3.6 shows the impact of core formation for a single iteration of the op-

timization process. The amount of data transferred across cores has reduced after

highly communicating functional units were put together into groups towards core

formation. This reduced the communication latency which contributed to the in-

crease in performance. The impact of core formation is higher for the workload 2,

where there is a higher dependency among instructions and as a result more number

of functional units has to communicate with each other. After the core formation

process, these communications were localized within the core.

31

0

1000

2000

3000

4000

After Core Formation

Before Core Formation

Data

tra

nsf

erre

d a

cross

core

s in

MB

Matrix Based Algorithms

Graph Based Algorithms

Data Transfer across Cores

Figure 3.6: Impact of core formation - Amount of data transferred before and aftercore formation by application of graph partitioning

32

CHAPTER IV

WORK TO BE COMPLETED

From the results obtained for single temperature simulated annealing it can be

seen that though there is a significant improvement in the performance of the hetero-

geneous architecture, the process terminates before the input specifications are met

with. It can be seen that using single temperature simulated annealing it is difficult to

explore the large design space arising out of the fact that the number of architecture

parameters is very high in the case of heterogeneous multi-core architectures. We

therefore adopt a multiple temperature simulated annealing with different tempera-

tures for different components of the design space. Also, so far only performance has

been considered in the optimization process. After the development of power model

for the entire architecture, power aspects are to be included in the optimization as

well.

4.1 Power model of CUBEMACH based architectures

The power model of the various ALFUs has been already developed. Since all the

units in the architecture are based on MIP cells, the number of MIP cells in the vari-

ous functional blocks of the architecture is expressed as equations. This can be done

after representing the architecture of the functional blocks using MIP cells. Once

a generalized equation which gives the relation between the number of MIP cells of

each kind and the various architecture parameters is established, the power consumed

by the entire architecture based on the CUBEMACH design paradigm can be found

from the number of MIP cells used in building the architecture. The work to be done

here is to represent every functional block of the architecture in terms of MIP cells.

33

Then equations which gives the number of MIP cells used in the architecture, given

the value of the various architecture parameters have to established. This has already

been done for the ALFUs. Similarly, equations have to be formed for other units as

well.

4.2 Optimization of heterogeneous multi-core architecturesusing multiple temperature simulated annealing

The results shown in the proposal made use of the single temperature simulated

annealing as the global optimization process. The work to be done includes im-

plementing a multiple temperature simulated annealing the advantages of which is

described in section 2. Closely related parameters have to be grouped together and

a starting temperature of each of this group has to be assigned. The temperature

schedule and the equation for the schedule has to determined for each of these groups

involved in the process. The temperature of each group will be dependent on the

distribution of parameters in that group.

The above diagram shows a limited number of parameters of a heterogeneous

Number of MatMul

Units

Number of MatAdd

units

Number of MatInv Units

Number of LUD Units

Number of Scheduler ports

Scheduler table sizeParallel

Scheduler Units

Cache Size

Block SizeMemory Ports

ONNET PortsBuffer Size

Switching Rate

MIN Stages

Figure 4.1: Multiple Temperature Simulated Annealing

multi-core architecture based on the CUBEMACH design paradigm. The tempera-

ture corresponding to each parameter is represented by a colour. As shown in the

34

diagram, different groups of closely related parameters have a common temperature

associated with them. During the optimization process each of these parameters are

explored according to their temperature. This process is yet to be implemented and

is expected to improve the performance obtainable from the resulting architecture.

35

REFERENCES

[1] S. Kirkpatrick et al., Optimization by Simulated Annealing, In Science 220, 1983.

[2] BW Kernighan, S Lin, An efficient heuristic procedure for partitioning graphs, In

Bell System Technical Journal, 1970.

[3] J. von Neumann et al., Theory of Games and Economic Behavior, 3rd ed. In New

York: Wiley, 1964.

[4] R. Kumar et al., Core architecture optimization for heterogeneous chip multipro-

cessors, In Proceedings of the 15th international conference on Parallel architec-

tures and compilation techniques, 2006.

[5] S. Kang et al., Magellan: a search and machine learning-based framework for

fast multi-core design space exploration and optimization, In Proceedings of the

conference on Design, automation and test in Europe, 2008.

[6] P. J. M. van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theory and

Applications. Mathematics and Its Applications. In Springer, Kluwer Academic

Publishers, Norwell, 1987.

[7] Andreas Nolte and Rainer Schrader. A note on the finite time behaviour of sim-

ulated annealing. In Operations Research Proceedings, 1996.

[8] Ross P.E. , Why CPU Frequency Stalled, In IEEE Spectrum, March, 2008.

[9] Geer D. , Chip makers turn to multicore processors, In Computer Vol.8 Issue 35,

May, 2005.

[10] In http://www.intel.com/products/processor/pentium_dual-core/index.

htm

36

[11] Richard McDougall and James Laudon, Multi-core microprocessors are here, In

USENIX Magazine, Vol 31, October 2006

[12] Shah M. et al. , UltraSPARC T2: A highly-treaded, power-efficient, SPARC

SOC, In IEEE Asian Solid-State Circuits Conference, , 2007

[13] Le, H. Q. et al. , IBM POWER6 microarchitecture, In IBM Journal of Research

and evelopment Vol.51 Issue 6, Nov 2007.

[14] In http://www.intel.com/consumer/products/processors/corei7.htm

[15] Dorsey J. et al., An Integrated Quad-Core Opteron Processor, In IEEE Interna-

tional Solid-State Circuits Conference, 2007. ISSCC 2007.

[16] Kahle J. A. et al., Introduction to the Cell multiprocessor, In IBM Journal of

Research and Development , 2005

[17] T. Chen et al., Cell Broadband Engine Architecture and its first implementation-

A performance view, In IBM Journal of Research, 2007.

[18] M. Gschwind et al., Synergistic Processing in Cell’s Multicore Architecture, In

MICRO, 2006.

[19] In http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_

Fermi_Compute_Architecture_Whitepaper.pdf

[20] Gupta, R.K. et al., Introducing core-based system design, In IEEE Design and

Test of Computers, 1997

[21] Ravindhiran Mukundrajan, Towards Modeling and Design Automation of Su-

percomputing Clusters: SCOC IP Cores and its application to biological neu-

ronal network modeling http://www.warftindia.org/Thesis/Vishwakarma/

Ravi_Thesis.pdf, Thesis submitted to WARFT

37

[22] R. Kumar et al., Single-ISA Heterogeneous Multi-Core Architectures for Mul-

tithreaded Workload Performance, In ACM SIGARCH Computer Architecture

News, 2004.

[23] Venkateswaran N et al., On the concept of simultaneous execution of multiple

applications on hierarchically based cluster and the silicon operating system, In

IEEE International Symposium on Parallel and Distributed Processing, 2008

[24] Venkateswaran N et al., Custom Built Heterogeneous Multi-core Architectures

(CUBEMACH): Breaking the conventions, In IPDPSW, 2010

[25] Shyamsundar G., Homogeneously structured heterogeneous functional cores for

SuperComputer on a Chip: MIP Paradigm based Design Space http://www.

warftindia.org/Thesis/Vishwakarma/ShyamSep2007.pdf, Thesis submitted to

WARFT

[26] Balaji Subramaniam. Towards Modeling And Intergrated Design Automation Of

Supercomputing Clusters(MIDAS): On Chip Networks for SCOC. Thesis submit-

ted to WAran Research FoundaTion(WARFT).

[27] Aravind Vasudevan. Towards Modeling And Intergrated Design Automation Of

Supercomputing Clusters(MIDAS): Hierarchical Compiler On Silicon. Thesis sub-

mitted to WAran Research FoundaTion(WARFT).

[28] Piotr Czyzak, Adrezej Jaszkiewicz, Pareto simulated annealinga metaheuristic

technique for multiple-objective combinatorial optimization, In Journal of Multi-

Criteria Decision Analysis, Jan 1998

[29] Venkateswaran N et al., Memory in processor: A novel design paradigm for

supercomputing architectures, in ACM SIGARCH Computer Architecture News

archive Volume 32, Issue 3, 2004.

38

[30] Venkateswaran N et al., Memory In Processor- Supercomputer On a Chip:

Processor Design and Execution Semantics for Massive Single-Chip Perfor-

mance, in 19th IEEE International Parallel and Distributed Processing Symposium

(IPDPS05)

[31] H.J.M Veendrick, Short-circuit dissipation of static CMOS circuitry and its im-

pact on design of buffer circuits. In IEEE journal of solid state circuits, August

1984.

[32] Wang, Q. et al., A new short circuit power model for complex CMOS gates, In

Proceedings IEEE Alessandro Volta Memorial Workshop, 1999

39

performance and power optimization of heterogeneous multi

Documents