performance and power optimization of heterogeneous multi
TRANSCRIPT
Performance and Power optimization ofheterogeneous multi-cores
A Thesis ProposalSubmitted to
WAran Research FoundaTion(WARFT)
by
Vignesh AdhinarayananSenior Research Trainee
In Partial Fulfillment of the Requirements forResearch Awareness Program and Training (RAPT)
atWaran Research Foundation
Vishwakarma - High Performance Computing Group
WAran Research FoundaTion (WARFT)Chennai, India
URL: http://www.warftindia.orgemail: [email protected]
Phone: +91-44-24899766
December 2010
Performance and Power optimization ofheterogeneous multi-cores
Approved by:
Prof N.Venkateswaran, AdvisorFounder-Director, WARFT
Date Approved:
To
Prof. Waran, my Guru
iii
ABSTRACT
Conventional processor design methodologies involving evaluation of all
possible configurations of architecture exhaustively is not suitable for heterogeneous
architectures considering the large number of parameters in this case. When these ar-
chitectures are designed to suit a wide class of applications, the optimization of such
architectures becomes very important, unlike the case of general purpose processors.
An optimization methodology which effectively searches the design space in
order to find architectures which satisfy the power, performance or performance to
power requirements of these multiple applications is proposed here. The optimization
methodology employs Game Theory in view of the large number of the architecture
parameters to select a subset of parameters to be perturbed at different stages of the
optimization process, Simulated Annealing to find architectures that matches the input
specifications, and graph partitioning to form cores with highly communicating func-
tional units grouped together in order to minimize communication latency. Simulated
annealing involving multiple temperatures, which is expected to give better results con-
sidering the large design space of heterogeneous architectures, is yet to be implemented
to optimize the architecture.
iv
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
I ORIGIN AND HISTORY . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Homogeneous multi-core processors . . . . . . . . . . . . . . . . . . 1
1.3 Heterogeneous multi-core processors . . . . . . . . . . . . . . . . . 2
1.4 Custom built heterogeneous multi-core architectures (CUBEMACH) 4
1.5 The need for optimization of custom-built heterogeneous multi-cores 5
1.6 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II PROPOSED RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Performance modelling of heterogeneous multi-core architectures . . 7
2.2 Power modelling of heterogeneous multi-core architecture . . . . . 7
2.3 Custom Built Heterogeneous Multi-Core Architecture (CUBEMACH)Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Application of Game Theory to choose the parameters to beperturbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Application of simulated annealing for choosing the optimalsolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Application of graph partitioning algorithm for heterogeneouscore formation . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Application for multiple temperature simulated annealing foroptimization process . . . . . . . . . . . . . . . . . . . . . . 16
III WORK COMPLETED . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Performance modelling of heterogeneous multi-cores . . . . . . . . . 18
3.1.1 Analytical model of a heterogeneous core . . . . . . . . . . . 18
3.1.2 Analytical model of Compiler on Silicon (COS) . . . . . . . 18
3.1.3 Analytical model of On Node Network (ONNET) . . . . . . 22
v
3.1.4 Analytical model for Bandwidth of memory for heterogeneousmulti-cores . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Power modelling of Algorithm Level Functional Units . . . . . . . . 24
3.3 Optimization of heterogeneous multi-core architectures using singletemperature simulated annealing . . . . . . . . . . . . . . . . . . . 26
3.3.1 Experimental methodology . . . . . . . . . . . . . . . . . . 26
3.3.2 Results for single temperature simulated annealing process . 28
IV WORK TO BE COMPLETED . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Power model of CUBEMACH based architectures . . . . . . . . . . 33
4.2 Optimization of heterogeneous multi-core architectures using multi-ple temperature simulated annealing . . . . . . . . . . . . . . . . . 34
vi
LIST OF FIGURES
2.1 Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 General Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Multiple Objective Simulated Annealing [28] . . . . . . . . . . . . . . 14
2.4 Modelling Communication . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Performance of the architecture over the iterations of the optimizationprocess - Matrix Based Algorithms . . . . . . . . . . . . . . . . . . . 28
3.2 Rate of scheduling for every 1000 clock cycles for intial architectureand final architecture - Matrix based Algorithms . . . . . . . . . . . . 29
3.3 Performance of the architecture over the iterations of the optimizationprocess - Graph Based Algorithms . . . . . . . . . . . . . . . . . . . . 30
3.4 Rate of scheduling for every 1000 clock cycles for intial architectureand final architecture - Graph based Algorithms . . . . . . . . . . . . 30
3.5 Number of units in the initial and optimized architecture - Matrix andGraph based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Impact of core formation - Amount of data transferred before and aftercore formation by application of graph partitioning . . . . . . . . . . 32
4.1 Multiple Temperature Simulated Annealing . . . . . . . . . . . . . . 34
vii
CHAPTER I
ORIGIN AND HISTORY
1.1 Introduction
The computational demand of applications, such as brain modelling, weather mod-
elling, protein folding, astrophysics simulation etc., has been continuously increasing
over time. This demand led to the development of processors capable of delivering
higher performance. Initially, in order to improve the performance of processors, the
clock frequency of the processors was increased to a point supported by improve-
ment in technology. However, this resulted in increased power dissipation and design
complexity as well as reduced reliability [8]. As a result, the trend of increasing the
clock frequency saturated after clock frequencies of 4GHz (Intel Cancelled a 4GHz
Processor) [9] was reached. In order to meet the increased demands of the applica-
tions, multi-core processors were introduced [9] in which higher performance can be
extracted without increasing the clock frequency or the design complexity.
1.2 Homogeneous multi-core processors
Conventional approach to multi-core design involves replicating the same core
many times in a single processor. These cores have the same kinds of units and the
same number of units of every kind. They operate at the same clock frequency and
are capable of delivering the same performance. These cores are optimized to con-
sume minimal power during execution. In the case of such homogeneous multi-cores,
increase in performance is achieved by extracting parallelism out of applications and
executing the independent parts of the applications on these multiple cores concur-
rently. The performance achieved by these processors is increased when multiple
1
applications are run on them as more parallelism can be extracted from multiple ap-
plications.
These homogeneous multi-cores have been widely used. Some examples in-
clude: Pentium Dual Core processors [10] where two identical cores are used to
construct the processor. Intel’s approach to multi-core processors has been to use
replicated copies of the same core many times in order to improve the performance.
These cores are optimized for high performance and low power dissipation. Another
example is the Sun Microsystems’s UltraSPARC processor[12] which is available with
four, six and eight identical cores. The UltraSPARC processor is capable of running
many threads in a single core and when a long latency event such as cache miss oc-
curs, another thread is brought in for execution[11]. Other examples of homogeneous
multi-core processors include Intel’s Core i7[14] , AMD’s Quad Core Opteron[15],
IBM POWER[13] etc.
One of the main advantages for resorting to homogeneous multi-core proces-
sors is that the design turnaround time is very less. Since each of the cores in a
processor is identical, only one of these needs to be optimized for the required per-
formance & power and can be replicated which does not take much time. Also, it is
sufficient to test and validate just one core which again helps in bringing down the
design turnaround time. Another important advantage is that the cost involved in
implementing the processors is reduced. Since the cores are identical in all aspects,
the masks used in fabrication can be reused for all the cores. Thus the use of homo-
geneous cores brings down the turnaround time and reduces the manufacturing costs.
1.3 Heterogeneous multi-core processors
As multiple applications are being run on multi-core processors in order to im-
prove their performance, the usage of homogeneous multi-cores would mean that each
2
of these cores should be capable of executing all the applications. These applications
can have diverse needs: one of them may be predominantly based on matrix oper-
ations; another can be based on graph theoretic operations and so on. When using
homogeneous cores, the cores should have sufficient resources in order to execute all
these applications. This would result in over provisioning of resources and conse-
quently resource wastage since all the cores will be having all kinds of units and only
some of these units will be used depending on the application mapped onto the core.
Some examples of heterogeneous multi-cores include the IBMs Cell Broadband
Engine [16] [17]. It contains a main processor, called as the Power Processing Element
(PPE), eight co-processors called as the Synergistic Processing Element (SPE) and
the Element Interconnection Bus (EIB) [17] [18]. The Instruction Set Architecture of
PPE’s and SPE’s are different and they have different functionalities. Another exam-
ple of the heterogeneous multi-cores is the GPUs. The nVidia Fermi Architecture[19]
consists of four specialised cores. These cores contain units for sine, cosine, reciprocal
and square root operations. All the other cores are identical in nature.
The disadvantages when it comes to heterogeneous cores are the increased
cost of fabrication as each of these cores are different in nature, increased design
turnaround time as each of the cores have to be designed separately & optimized for
the necessary performance and power requirements and then tested and verified to
ensure their proper functionality. However, the implementation cost can be brought
down by resorting to a cell based approach where a limited set of cells are used.
Another way of reducing the cost factor involved in designing heterogeneous archi-
tectures is by using SCOC IP Cores [21] (SuperComputer On Chip IP Core). The
advantages offered by heterogeneous multi-cores include better resource utilization
[22] as applications or parts of application can be mapped on to the cores which best
fit them, making use of the resources available efficiently. Better resource utilization
also means that the power consumed in executing the applications can be minimized
3
as there are fewer units that are idle, thereby consuming lesser static power.
1.4 Custom built heterogeneous multi-core architectures (CUBE-MACH)
When heterogeneous cores are being used, the design of such architecture should
reflect the needs of the multiple applications being run on them concurrently. The
CUBEMACH (Custom Built Heterogeneous Multi-core Architectures) design paradigm
offers the possibility of increased resource utilization by running multiple applications
simultaneously[23] without space time sharing. The heterogeneous architectures
based on the CUBEMACH design paradigm[24] are designed to suit the multiple
applications that are to be run on the architecture. These architectures employ a va-
riety of Algorithm Level Functional Units (ALFU)[25] apart from scalars in order to
meet the performance requirements of the applications. Some examples of the ALFUs
include 2X2 Matrix multiplication units, 2X2 Crouts Units, Graph Traversal Units,
Sorter Units etc. These Algorithm Level functional units are driven by algorithm
level instructions. These algorithm level instructions constitute the ISA for the ar-
chitectures based on the CUBEMACH design paradigm and this ISA is called as the
Algorithm Level Instruction Set Architecture (ALISA)[25].The architectures designed
based on the CUBEMACH design paradigm use a cell based approach implemented
on SCOC IP Cores to minimize cost[21]. The On-Node-Network (ONNET)[25] based
on Multistage Interconnection Network (MIN) takes care of the communication both
within the cores and across the cores. The Compiler-On-Silicon (COS)[25] takes care
of the compilation and scheduling of the compilation and scheduling. Hardware based
approach is used to meet the high performance requirements.
4
1.5 The need for optimization of custom-built heterogeneousmulti-cores
Conventional processor design methodologies involve exhaustive design space ex-
ploration in which all possible processors that could be built from the design space is
considered and simulated in order to study how these processors perform for different
sets of inputs[4]. From the set of all possible processors from the design space, a
single processor, which comes closest to satisfying the requirements of the user of the
system, is selected on the basis of the results of simulations and is implemented. This
approach to design processors had no problems until heterogeneous multiple cores
came into being [19] [16], in order to cater to the diverse computational needs and
the power requirements of the Grand Challenge Applications. The usage of heteroge-
neous multiple cores made the conventional processor design methodology obsolete as
the number of architectural parameters to be considered in the design space becomes
very high, thereby making the design space larger. This means that, the number of
possible processors from the design space increases exponentially[5]. As a result, the
time taken to explore the design space exhaustively also becomes very high. So the
need for effective design space exploration in order to prune the search space arises.
The need for optimization in general purpose processors in comparatively lesser
as these processors are not specialized for running any particular application or a class
of applications. In Application specific processors it becomes important to tune the
processor to suit that particular application running on the processor. Optimizing
the processor to suit that application becomes critical. CUBEMACH based proces-
sors, which are not general purpose processors, favours running a wider class of on the
same node without time or space sharing. It suits a class of applications. These appli-
cations can have different requirements (performance, power, performance to power
ratio etc.). The architecture design has to reflect the requirements of these individual
applications. If these applications belong to different users, then optimization should
5
be carried out to satisfy the requirements of the different users. Otherwise, if the
applications belong to a single user the optimization has to be carried out to satisfy
the requirements of the different applications. So the need for optimizing the archi-
tecture to suit the multiple applications requirements arises.
1.6 Overview of Thesis
The optimization methodology proposed here explores a pruned design space in
order to find an architecture that meets the performance, power or performance to
power ratio requirements of the multiple applications, which are to be run on the
architecture designed based on the CUBEMACH design paradigm. Game Theory[3]
is employed by the optimization process considering the vast number of intricately
related architecture parameters. It selects a subset of architecture parameters every
iteration whose values are changed. Simulated annealing[1] is employed to explore
the search space in order to achieve the multiple objectives which are the individual
applications requirements. The optimization technique makes use of Kernighan-Lin
Algorithm[2] for core formation to group ALFUs into population[24] and population
into cores in order to minimize communication across populations and cores which
reduces the execution time of an application.
6
CHAPTER II
PROPOSED RESEARCH
2.1 Performance modelling of heterogeneous multi-core ar-chitectures
An analytical model of a system is used to predict the performance of the system
for some applications. The system is represented making use of the various archi-
tecture parameters of the system. The analytical model of the system is the set of
equations, whose variables are the architecture parameters, which describe the perfor-
mance of the system. Thus, this set of equations brings about the relationship among
the various parameters of the system and can be used to predict the performance of
the system under ideal conditions. We form equations, relating the various architec-
ture parameters with each other after analyzing the CUBEMACH based architecture.
2.2 Power modelling of heterogeneous multi-core architec-ture
The architectures based on CUBEMACH design paradigm are built from MIP cells
which are the basic building blocks. The power modelling of CUBEMACH based ar-
chitectures will involve the power modelling of the various units in the architecture
which are built from MIP cells, modelling the power consumed by memory and the
modelling the power consumed by interconnects and bus lines. Cells are logically
equal to CMOS complex logic gates and their variants. Hence, it is sufficient to ob-
tain models of dynamic, static and short-circuit powers for the individual MIP Cell
[4] and further extrapolate the models to estimate the total power consumed by the
architecture.
7
2.3 Custom Built Heterogeneous Multi-Core Architecture(CUBEMACH) Optimization
The Optimization engine for CUBEMACH is shown below. The requirements
of the different applications belonging to the same or different user are given as
input to the optimization engine. Values of the various architectural parameters
are assigned based on the analytical model of the architecture and the application
to be run. The Game Theory engine selects a subset of the parameters from the
set of all architecture parameters. This subset of parameters is selected based on
some heuristics and only those parameters which are expected to significantly affect
the result are selected. Simulated Annealing is employed as the global optimization
process and the values of the selected parameters are varied. This new architecture
derived is simulated and from the results obtained the communication pattern is
obtained and modelled as a graph. The KL graph partitioning algorithm is employed
to group the highly communicating ALFUs into a single population. The resulting
architecture is simulated again and the performance to power ratio obtainable from
this architecture is calculated. This value is compared with the required performance
to power ratio. When the input requirements are met the optimization process stops
and the solution is obtained.
2.3.1 Application of Game Theory to choose the parameters to be per-turbed
The implementation of Game Theory is tightly integrated with the implementa-
tion of Simulated Annealing. Game theory has been employed to select a subset of
the architecture parameters whose value has to be changed in every iteration of the
simulated annealing process to determine the neighbour state as explained in next
section.
8
Game Theory
Simulated Annealing
Simulation
Core Formation
Simulation
Near Optimal Solution
Architecture Parameters
Desired Performance
Obtained Performance
Figure 2.1: Optimization Flow
The objective of the game is to reach an architecture state, in which all the
user requirements (performance, power, performance to power ratio) for individual
applications are satisfied. The players of the game are the various architecture pa-
rameters. This application is a case of cooperative Game Theory in which the players
of the game work together in order to achieve a common goal, the goal being reaching
architecture state where requirements (performance, power etc) are met.
The parameters of the CUBEMACH are intricately interrelated. Selection of
one parameter by the Game theory engine might not show an improvement in per-
formance unless its related parameters are also selected and their values modified.
For example, increasing the number of ALFUs does not show an improvement in
performance unless the scheduler has been modified to be powerful enough to deliver
instructions at a rate to cater to increase in ALFUs. Also, when one player (an ar-
chitecture parameter) is selected to play the game (achieving the multiple objectives)
9
in that iteration, there is another player (parameter) interacting with the first men-
tioned player, which could give better results. The Game theory Engine ensures that
an appropriate subset of the parameters is chosen that helps in quickly achieving the
goal of the game. Also the Game Theory has an arbitrator to ensure that the value by
which a parameters value is changed is within acceptable limits and an unrealizable
architecture doesnt result at the end of the optimization process. Another purpose
of the arbitrator is to ensure that the values of the parameters do not conflict with
each other.
The implementation is as follows. Every parameter (the player in the GT)
is initially assigned some rank. The Game theory selects a subset of the parameters
whose value is changed by some acceptable value (ensured by the arbitrator) based on
the rank of the parameters and their dependency. The resulting architecture is simu-
lated. If there is an appreciable improvement in the performance as compared to the
previous architecture state, then the subset of parameters selected have contributed
positively and hence their ranks are increased and if there is not much change in
the performance, then their ranks are decreased. In every iteration, the game theory
selects parameters whose rank has a higher value and is expected to significantly af-
fect the performance of the architecture. Also when the subset of the parameters is
selected, those parameters which are closely related with the subset parameters are
also included in the subset. This is achieved making use of an analytical model of the
CUBEMACH architecture which brings about the dependency across the parameters.
Another variation in implementation includes assigning an initial rank to the various
teams of the game. These teams include a subset of the entire parameter set, which
are the players of the game. The parameters within the team are also assigned ranks.
In each iteration, some teams are selected to play the game. Each team contains
parameters which are closely related to each other. From the teams selected, a subset
10
of players are selected every iteration based on the degree of closeness across parame-
ters. The rank of the players within the team shows how much the player affects the
optimization process within the team. So, in every iteration, the game theory selects
teams which best affect the optimization process and from the team subset, closely
related players are selected whose values are changed to determine the neighbour state
of simulated annealing as explained in next section. Thus the usage of Game Theory
Engine in the Optimization Engine ensures that only the values of those parameters
which are expected to significantly affect the performance are changed.
2.3.2 Application of simulated annealing for choosing the optimal solu-tion
Simulated annealing derives its name from the metallurgical process annealing in
which a metal is heated to a high temperature and cooled very slowly in order to
remove its defects. The heat causes its ions to move freely to states with higher en-
ergy and the slow cooling results in reaching configurations which have lower energy
than the initial state i.e. the structure of the metal is reformed, first by heating it
to a very high temperature and then cooling it slowly in order to remove its defects.
Similarly, in simulated annealing, we start at a high temperature T, where even poor
solutions (architecture states far from the required solution) are accepted often, and
as the temperature is reduced slowly, only those solutions which are better than the
state from the previous iteration are accepted.
An architecture state has a value assigned to every architectural parameter
from the design space. Every state has an energy Estate associated with it. This
energy is the value of an objective function where the objective can be to minimize
power, or maximize performance, or maximize the performance to power ratio. The
value of energy is such that by minimizing the value, the objective is achieved. Here,
we try to optimize the architecture with respect to performance to power ratio and
11
hence, the energy Estate is a measure of performance to power ratio. In every itera-
tion, the value of certain parameters from the previous iteration is changed, resulting
in a neighbour state which has an energy Estate+1 associated with it. The parameters
whose value is to be changed are selected by applying Game Theory as discussed in
previous section. ∆ E = Estate+1 - Estate is the difference between these two states. If
∆ E is lesser than 0, it means that we have got a better solution and the neighbour
state is be accepted. However a worse solution is also accepted sometimes in order
to ensure that the process does not get stuck in local minima. To determine which
of the worse solutions can be accepted we calculate two probabilities P1 and P2 as
shown below.
P1 = e−∆E/kT
P2 = random (0, 1)
Where k is the Boltzmann constant,
T is the temperature of the process.
When the probability P1>P2 a worse solution is accepted. As T reduces
slowly over a period of time, the value of E/kT increases in general, and hence the
value of P1 decreases. When the value of P1 decreases, only a very few number of
poor solutions are accepted in the later stages of Simulated annealing process.
The rate at which T is reduced is important. The rate at which the temper-
ature is reduced is determined by a process known as temperature scheduling. The
success or failure of the Simulated Annealing process is very much dependent on the
Temperature scheduling. If the temperature is reduced from an initially very high
value Tstart to 0K over infinite iterations, it is assured that a global minimum is
reached, proof of which can be found in [7] and [6]. However, it is not practical to
have infinite iterations.
If the cooling takes place at a fast rate, there is no guarantee for the process
12
to converge to a global minimum and the process becomes a simulated quenching.
The temperature has to be lowered at a rate such that convergence to a global min-
imum is still possible. The Simulated Annealing process is continued until the user
requirements are met with. If an unreasonably high goal (in this case, performance to
power ratio) is required, which cannot be achieved, the process continues till a global
minimum is reached or in other words, the process continues till the solution which
comes closest to meeting the user objectives/requirements is reached.
S ← Initial State S0
T ← Initial Temperature T0
while (termination criteria is not met) or (T!=Tend) doS’← Neighbor of S∆E ← ES – ES’
If ∆E < 0S ← S’
elseT ← UpdateTemperature()P1 ← e-∆E/kT
P2 ← random (0,1)If P1>P2
S ← S’Output Send
Figure 2.2: General Simulated Annealing
Another slight variation can be used when the CUBEMACH architecture needs
to run multiple independent applications belonging to a single user or many differ-
ent users without space sharing and time sharing. These multiple applications can
have different requirements and different priorities. For example, application 1 may
have a high performance requirement; application 2 may have a requirement that it
should consume as little power as possible during execution, though it can take a
long time to finish execution and so on. In this case the architecture has to reflect
the requirements of the multiple applications. In such a case while calculating the
13
value of E, we consider the energy difference for each application separately. The
energy difference for application 1 is based on the performance; energy difference for
application 2 is based on power and so on. The weighted sum of these differences for
individual applications is taken as the value of E. The value of the weights in this
sum is dependent on the priorities of the application, with higher weights applied
to applications with higher priorities. The priority of the applications, on the other
hand, can be determined by the cost (money) which the user shares, i.e., fraction
of the money contributed by the user spends on building the whole CUBEMACH
architecture. This way the architecture favours the application with a higher priority,
while satisfying the individual requirements of the multiple applications.
S ← Initial State S0
T ← Initial Temperature T0
while (termination criteria is not met) or (T!=Tend) doS’← Neighbor of S∆E1 ← E1S – E1S’
∆E2 ← E2S – E2S’
∆En ← EnS – EnS’
∆E ← w1 ∆E1 + w2 ∆E2 + …. + wn ∆En
If ∆E < 0S ← S’
elseT ← UpdateTemperature()P1 ← e-∆E/kT
P2 ← random (0,1)If P1>P2
S ← S’Output Send
Figure 2.3: Multiple Objective Simulated Annealing [28]
2.3.3 Application of graph partitioning algorithm for heterogeneous coreformation
The structure of ONNET (the communication backbone of CUBEMACH) is such
that the latency involved in transferring data from an ALFU in one population to
an ALFU in another population is greater than the latency involved in transfer of
14
data between ALFUs belonging to the same population. Therefore, if highly com-
municating ALFUs are in the same population, then the latency involved in transfer
of data from one ALFU to another is greatly reduced. Transfer of intermediate data
from one ALFU to another takes place often in higher end applications. So suitable
grouping of ALFUs greatly improves the performance of the system.
MATMUL ALFU
MATADD ALFU
MATMUL ALFU
LUD ALFU
1.4 KB
0.3KB
0.3 KB
1.4 KB
Node representing the ALFU
Edges representing the bytes transferred across
Figure 2.4: Modelling Communication
The amount of data transferred between all pairs of ALFUs in bytes is available
from the simulation results. This data can be used to group highly communicating
ALFUs suitably. The simulation results are made use of to construct a graph theoretic
model of the communication pattern. The nodes of the graph are the individual
ALFUs in the architecture. The edges represent that the nodes connected by an
edge communicate with each other by transfer of intermediate data. The edge weight
is the amount of data transfer in bytes between the two nodes connected by the
edge. The Kernighan-Lin graph partitioning algorithm can be applied on this graph
to form sub-graphs whose external cost is minimal. Each sub-graph represents a
population and the internal cost (the sum of edge-weights for edges within the sub-
graph) represents the amount of data transfer within the population and external
cost is the amount of data transfer across population. So, by minimizing the external
15
cost, we limit the number of bytes transferred across population, thereby improving
efficiency. Similarly, each population can be modelled as a node and the edges can
represent data transfer among populations with the edge weight being the amount of
data transferred across the nodes connected by the edge. The KL graph partitioning
algorithm can be applied on this graph to group population into cores.
2.3.4 Application for multiple temperature simulated annealing for opti-mization process
Another approach to perform the optimization is to use multiple temperature simu-
lated annealing in place of single temperature simulated annealing. In this approach,
we do not use the same acceptance probabilities for all the parameters. This work
is to be taken next and using multiple temperatures is expected to handle the large
number of parameters in the case of heterogeneous multi-cores better.
It may be seen during the optimization process that some parameters would
have reached their optimal value much earlier than other parameters. That is, some
parameters could have been explored more than the other parameters. Suboptimal
values of these parameters could have been explored and then subsequently accepted
or rejected by the optimization technique. At later iterations, the value to be fixed up
for these parameters would be known approximately and hence many poor solutions
are rejected. However, there could be some other parameters which have not been
explored sufficiently in the initial stages. If the values of these parameters were to be
changed at later stages of the optimization process, when the temperature is very less,
and it is noticed that a poor solution has been arrived at, then these values will be
rejected. So, these parameters are not sufficiently explored and their optimal values
are not arrived at. This situation could occur often especially when the number of
parameters is very high.
The approach proposed here is to use different temperatures for the simulated
annealing for different teams. Closely related parameters can be grouped together
16
as a team and the acceptance probabilities for each team can be selected based on
their temperatures. So, even if some parameters are not selected initially in the op-
timization process, their temperatures would not have reduced if their teams are not
selected. This ensures that the parameters selected later are also sufficiently explored.
17
CHAPTER III
WORK COMPLETED
3.1 Performance modelling of heterogeneous multi-cores
The analytical model for CUBEMACH based architecture[24] is as follows:
3.1.1 Analytical model of a heterogeneous core
The various parameters of the core include the number of functional units within
the cores and also their types. Some of the parameters of the core and node are
denoted as
Mp = Number of PCOS inside a node
Ma = Number of ALFUS under 1 SCOS
Ms = Number of SCOS under 1 PCOS
Total number of ALFUs in a core= Mp*Ma*Ms
The size of the population buffer is given by,
Population Buffer Size = No. of ALFUs in a population * MAX(Latency of memory,
Latency of ONNET) + Best case latency of ALFU)
3.1.2 Analytical model of Compiler on Silicon (COS)
The various parameters of the Compiler-On-Silicon include the size of the various
tables in the COS[27], which would determine the number of entries in the tables
of the COS and thereby affecting the scheduling rate, the bandwidth of the SCOS,
PCOS etc. The equations which determine the bandwidth of the SCOS, PCOS and
the size of the various tables, the size of the libraries etc are as follows:
18
3.1.2.1 Analytical model of Secondary Compiler On Silicon (SCOS)
Suppose the required performance is T Teraops and the input contains all types
of instruction from the Instruction Set Architecture. Assuming that the architecture
contains N types of ALFUs, each for executing one type of instruction from the ISA,
if there are a instructions for being executed in ALFU of type 1, b instructions for
being executed in ALFU of type 2 and so on. If there are ALFU1 number of ALFUs
of type 1, ALFU2 number of units of type2 and so on.
The various symbols used here are
C = Maximum number of Control Words that can be formed from 1 Sub Library
K = Maximum number of Sub-libraries input to 1 SCOS in 1 second
1 Library = Smax Sublibraries
(i.e. 1 Library can consist of a maximum of S Sublibraries )
Ni = Number of Instruction in a Control Word
No. of ops/ALFU t = T / Total number of ALFUS = T/(Mp*Ms*Ma) ... (1)
No. of control words/ALFU = t/Ni Control Words-per-second per ALFU ... (2)
Peak BW of 1 SCOS = I/p to Ma number of ALFUs/sec
= t * Ma / Ni ControlWords-per-second [from (2)]
Peak BW of all SCOS = Peak BW of 1 SCOS* Ms
Expected BW of all SCOS =
Sustained resource utilization of all ALFUs * peak BW of SCOS ... (3)
Rate of Input to SCOS (in control words) = K*C ... (4)
Under, ideal conditions, rate of Input of SCOS, in terms of control words, will be
19
equal to its output, in terms of control words. So, from (2) and (4),
K*C = t*Ma / Ni
From above, K = t*Ma / (Ni * C) Sub Libraries ...(5)
The above equation gives the maximum number of sub-libraries that must be sched-
uled to the Secondary Compiler On Silicon (SCOS) under ideal conditions.
Processing capacity of SCOS =SCOS output rate
No. of instructions per sub-library
= K*C*NC*Ni
= K
The processing capacity of SCOS is in terms of sub-libraries and the SCOS o/p rate
is in terms of instructions.
The size of one entry in the sub-library detail table is given by,
Sub-library detail table Size = K * SizeSDT
SizeSDT = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of
sub-libraries/lib)]/8+ 2*log2(logical memory size) +1
The instruction detail table should hold all the instructions of at least 1 sub-library.
So the minimum size of the instruction detail table is given by,
Instruction Detail Table Size = Maximum size of a sub-library * SizeIDT
= C * Ni * Size IDT
SizeIDT in bytes = [ log2(No. of applications) + log2(No. of libraries/app) +
log2(No. of sub-libraries/lib) + log2(C*Ni ) + log2(No. of instructions in ISA)]/8+
4*log2(logical memory size) +2
20
3.1.2.2 Analytical model of Primary Compiler On Silicon (PCOS)
The equation describing the peak bandwidth of PCOS is given by
Peak BW of PCOS = Number of SCOS under a PCOS * Input to 1 SCOS
= Ms * K Sub libraries
= Ms * (t *Ma/(Ni*C)) [refer 5] ... (6)
Rate of Input to PCOS in terms of Libraries = Ms * K / S ... (7)
The above equation gives the maximum number of libraries which should be scheduled
to the PCOS under ideal conditions.
3.1.2.3 Analytical model of the scheduler
The functional Unit status table and the functional unit utilization table is used
by the schedulers to decide the ALFU to which an instruction must be scheduled to.
The functional Unit status table contains the status information of all the ALFUs
within the core. So the number of entries is equal to the number of ALFUs in that
core. The same applies for Functional Unit Utilization table as well.
Functional Unit Status table Size = No. of ALFUs /core * SizeFUS
= Mp*Ms*Ma * SizeFUS
Functional Unit Utilization table Size = No. of ALFUs / core * SizeFUU
= Mp*Ms*Ma * SizeFUU
SizeFUS in bytes = 4 + log2(Mp*Ms * Ma)
SizeFUU in bytes = 1 + log2(Mp*Ms*Ma)
The minimum size of the Buffer detail table is determined by the following
equation:
Instruction Buffer Table Size = Scheduler latency * No. if ALFUs/core * SizeIBT
= Scheduler Latency * Mp*Ms*Ma * SizeIBT
21
SizeIBT = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of sub-
libraries/lib) + log2(C*Ni ) + log2(No. of instructions in ISA) + log2(Mp*Ms*Ma)]/8+
4*log2(logical memory size) +2
The size of the SCOS output buffer is given by,
SCOS o/p buffer size = Population buffer size * Number of ALFUs in a population
= Population buffer size * Ma
Primary Compiler On Silicon Scheduler
The size of the various tables required for the PCOS to schedule the sub-
libraries to the SCOS is given by
Library Address Table Size = No. of libraries scheduled to the PCOS * SizeLAT
Library status table size = No. of libraries scheduled to the PCOS * SizeLST
SizeLAT = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of
sub-libraries/lib)]/8 + 2*log2(logical memory size)
SizeLST = [ log2(No. of applications) + log2(No. of libraries/app) + log2(No. of
sub-libraries/lib)]/8 +2
3.1.3 Analytical model of On Node Network (ONNET)
The various parameters of the On-Node-Network include the number of buffer
stages in the ONNET[26], the number of MIN stages, the number of ports etc. These
parameters decide the bandwidth of the ONNET. When the bandwidth is low, then
the resources would be under-utilized as the functional unit would be waiting for the
ONNET to deliver the data without executing the instructions scheduled to them.
The various equations concerned with the On-Node-Network are as follows:
22
Peak Processing Capacity of SLR per Clock =
No. of alfus in pop∑i=1
(ki + li) =
No. of alfus in pop∑i=1
Ti ∗ (ki + li) (3.1)
Where,
Ti = No. Of ALFUs of type T in population
Ki = Number of input bits to an ALFU
Li = Number of output bits from an ALFU + Packetizing bits
Li = Ki + 2 log2 (64)
where, 2 log2 (64) is the number of bits used for packetizing.
Peak Bandwidth = Processing Capacity / PdSLR
Where, PdSLR = Pipeline Delay of Sub local router
Resource utilization, R=No. of Clock Cycles in opearation
Total number of clock cycles
Effective Resource utilization : Re=Total no. of clocks∑
i=1
No. of instructions in pipe at Ci
Pipeline stages× 100
Total no. of cycles of execution
Expected bandwidth = Re * peak BW of Population
Peak BW =QPs
=
No. of alfus in population∑i=1
Ki + Li
Minimum(P1 ,P2, . . . )
Here P1, P2, P3, P4 are the pipeline delay of each of the ALFUs in a population,
the minimum of which is chosen as the pipeline delay of the sub local router.
Q = Total output from Sub Local Router
For max value, Q =No. of ALFUs∑
i=1
No. of ports per ALFU * bytes per clock
23
SLR latency = MIN Latency + Packetization Latency
+ (No. Of buffers*Buffer Delay)
MIN Latency = Switching Latency * No. Of Stages
Packetization Latency = Data SizePacketsize
∗ time taken for single packet
Packet Drop Probability = (1 − No. of bufferNo.ofPackets/Clock
) ∗ 100
Maximum ALFU Data size = Clock Ratio of MIN with ONNET * Buffer Bytes
* Number of Ports
3.1.4 Analytical model for Bandwidth of memory for heterogeneous multi-cores
The bandwidth of memory is given by the following equation
Bandwidth of Memory =Portmax∑
i=1
W ∗ R*Pr*Ph
Min Pipeline delay
Where,
Portmax = Maximum number of ports
R = Resource Utilization
Pr = Probability of read
Ph = Probability of hit
W = Size of a word in bytes
All these equations together would determine the performance of the CUBEMACH
design paradigm based architecture under ideal conditions.
3.2 Power modelling of Algorithm Level Functional Units
The CUBEMACH architecture consists of ALFUs where the basic computational
units are the MIP Cells[29][30]. MIP cells are logically equal to CMOS complex logic
gates and their variants. Hence, it is sufficient to obtain models of dynamic, static
24
and short-circuit powers for the individual MIP Cell and further extrapolate the mod-
els to estimate the ALFUs total power. The dynamic power dissipation is obtained
by estimating the number of power consuming transitions for each MIP Cell under
consideration. The load capacitance of a MIP Cell is determined by the number of
fan-out MIP Cells.
The analytical equation used to calculate the dynamic power dissipation is given by
Pdyn = Cload ∗ V 2DD ∗ activityfactor ∗ frequencyofoperation
where,
VDD = Supplyvoltage
Activity factor =number of power dissipating transitions
total number of transitions
Frequency of operation = ALFU clock rate
Cload is the number of fan-out MIP Cells
To calculate short circuit power dissipation, the MIP Cells were modelled as a single
equivalent inverter and the input signals to the MIP cells was also modelled as a single
equivalent input signal[32], the input rise time and the output fall time, which affect
the transistor peak current, were assumed to be equal so that the power dissipation
due to direct path current is minimized[31].
The analytical expression used to estimate the direct path current is given by
Pdp = (Cshort−circuit + Cload) ∗ V 2DD ∗ frequencyofoperation
Where,
Cshort−circuit = tsc ∗ Ipeak ∗ VDD
As mentioned before modelling the static power dissipation is important in the design
phase of the architecture. Static current in a CMOS device is due to the reverse biased
junctions of the transistors when it is in the off mode and also due to sub-threshold
current as discussed before. The analytical expression used to calculate the static
power is given by,
Pstatic = IStat ∗ VDD
25
Where,
Istatic = Ileakage + Isub−threshold + IBandtoBandtunnelrr
is the static current of the transistor
The values of the leakage, static current and threshold voltage and the supply voltage
considering the 45nm technology were obtained from ITRS. The total power dissipa-
tion is the sum of the dynamic, static and direct path current dissipations.
Ptotal = Pdyn + Pstatic + Pdirect−path
With these equations, the total power dissipation of a MIP Cell is obtained. With
these values, the total power dissipation of the ALFUs is calculated by obtaining
equations for each type of ALFU which gives the number of MIP cells of each types
with varying problem size and the size of data.
3.3 Optimization of heterogeneous multi-core architecturesusing single temperature simulated annealing
3.3.1 Experimental methodology
We implemented a clock-accurate full system simulator which simulates the sched-
uler, the functional units, the backbone network architecture and the memory. The
scheduler is implemented in the simulator as a hardware scheduler (such as in [16])
for scheduling the instructions to the various functional units. The functional units in
the architecture includes some basic scalar units such as adders, multipliers, dividers;
matrix units such as matrix adders, matrix multipliers, matrix invertors, crouts units
all of which solves problems of size2X2 where these units are built from basic scalar
units; graph theoretic units including Graph Traversal Units and Sorters, which are
built from basic scalar units. A hierarchical network employing the omega Multistage
Interconnection Network (MIN) structure was included in the simulator. The hier-
archical network is used for data transfer within the core as well as across cores. A
subset of the functional units within the core has a private L1 cache, and each core
26
has its own L2 cache, and the L3 cache is common to all the cores. The simulator
is clock driven. After all the units in the architecture receive the clock signal, they
finish their execution for that clock cycle. After all the units finish their execution
for that particular clock cycle, the next clock trigger signal is sent to all the units for
execution in the next clock cycle.
The values of the various architecture parameters such as the number of parallel
schedulers, the kinds of the functional units in the architecture and the number of
functional units of each kind, the buffer sizes for the hierarchical network, the number
of ports in the MIN structure, the size of L1 cache, L2 cache and the L3 cache etc.
for each of the cores are written in configuration file. These values are read by the
clock-accurate simulator to start the optimization process. The architecture which
corresponds to the parameter values in the configuration file is simulated and the
results of the simulation such as the performance of the architecture, the resource
utilization of the functional units, data transferred across the cores etc. are obtained.
The optimizer updates the values of the various parameters into the configuration
file according on the proposed technique. When the input specifications are met, the
optimization process is stopped and the final value of the parameters is taken as the
near optimal solution.
For testing the optimizer, we generated our own complex workload with high
computation and communication complexity. A variety of algorithms were used in
generating these workloads. The results are shown for two workloads; one which
is computationally intensive making use of matrix based algorithms such as matrix
multiplication, LU-Decomposition and matrix inversion predominantly whereas the
other workload consists of algorithms such as sorting, graph traversal and convex-
hull. These algorithms are widely used in scientific and commercial applications and
characterize their workloads.
27
3.3.2 Results for single temperature simulated annealing process
The results are shown for two workloads. The first workload is composed mainly
of matrix based algorithms such as matrix multiplication, matrix inversion, LU-
Decomposition etc.
The second workload is composed mainly of graph based algorithms such as graph
traversal, sorting and convex-hull. The results shown are for 4-core architecture with
each core consisting of 32 functional units with the functional units operating at a
frequency of 800MHz.
0
50
100
150
200
250
300
350
400
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Iterations
Per
form
ance
in G
FL
OP
S
Performance of Architecture over the iterations
Figure 3.1: Performance of the architecture over the iterations of the optimizationprocess - Matrix Based Algorithms
Figure 3.1 shows the performance variation over the iterations of the optimization
process for workload 1, which largely consists of matrix based algorithms. The input
specifications demanded a performance of 500 GOps from the architecture. A maxi-
mum performance of 375 GOps was achieved from the optimization process as shown
in Fig. 3.1 before the process terminated. The initial architecture had a resource
utilization of about 29% which increased to over 42% as a result of the optimization
process. One reason for the low resource utilization from the optimization process is
the high dependency in the instructions of the workload generated. Another reason
is that a single temperature based simulated annealing is not sufficient for optimizing
28
0
10
20
30
40
50
60
70
044000
88000
132000
176000
220000
264000
308000
352000
396000
440000
484000
528000
572000
616000
660000
704000
748000
792000
836000
880000
924000
968000
Aver
age
Sch
edu
ling
rat
e/
tho
usa
nd
clo
ck c
ycl
es
Scheduling Rate of Initial and Optimized Architecture
Initial Architecture
Optimized Architecture
Clock Cycle
Figure 3.2: Rate of scheduling for every 1000 clock cycles for intial architecture andfinal architecture - Matrix based Algorithms
such a heterogeneous architecture with very large number of parameters. For this rea-
son we propose the usage of multiple temperature simulated annealing to handle the
large number of parameters which is to be taken up as a future work. The scheduling
rate for the initial architecture and the optimized architecture for workload 1 is shown
in Fig. 3.3 for 1 million clock cycles. It can be seen from the graph (Fig. 3.3) that
the scheduling rate has increased for the optimized architecture. This is because the
number of matrix based units has increased in the final optimized architecture which
can be seen from Figure 3.5.
Figure 3.2 shows the performance variation over the iterations of the optimization
process for workload 2, which largely consists of graph based algorithms. A perfor-
mance of 475 GOps was achieved as a result of the optimization process as shown in
figure 3.2 where the input specification was 750 GOps. For this workload, a resource
utilization of 30% was achieved in the optimized architecture from a value of 18%
in the initial architecture. The low resource utilization is a direct consequence of
the high level of dependency among the instructions in the workload and the fact
that a single temperature based simulated annealing process is not sufficient for the
large number of parameters in heterogeneous architectures. The scheduling rate has
29
Iterations
Per
form
ance
in G
FL
OP
S
Performance of Architecture over the iterations
0
50
100
150
200
250
300
350
400
450
500
1 3 5 7 9 11131517192123252729313335373941434547495153555759
Figure 3.3: Performance of the architecture over the iterations of the optimizationprocess - Graph Based Algorithms
0
5
10
15
20
25
30
35
40
45
50
039000
78000
117000
156000
195000
234000
273000
312000
351000
390000
429000
468000
507000
546000
585000
624000
663000
702000
741000
780000
819000
858000
897000
936000
975000
Initial Architecture
Optimized Architecture
Clock Cycle
Aver
age
Sch
edu
ling
Rat
e /
tho
usa
nd
clo
ck c
ycl
es
Scheduling Rate of Initial and Optimized Architecture
Figure 3.4: Rate of scheduling for every 1000 clock cycles for intial architecture andfinal architecture - Graph based Algorithms
increased to an average of 38 instructions per clock cycle from an initial value of 22
instructions per clock cycle for the starting architecture. This is mainly because of
the increased availability of suitable functional units which can be seen from Figure
3.5. Most instructions are dependent on the remaining smaller set of instructions
in the workload. This explains the scheduling rate of the optimized architecture in
which the number of instructions scheduled increases and decreases alternately. The
dependent instructions are executed in parallel in large numbers after their parent
30
instructions finish execution in the case of workload 2.
0
20
40
60
80
100
120
140
Initial
Architecture
Initial
Architecture
Optimized
Architecture
Optimized
Architecture
Matrix Based Algorithms
Graph Based Algorithms
Graph Theoretic Units
Scalar Units
Matrix Units
Figure 3.5: Number of units in the initial and optimized architecture - Matrix andGraph based algorithms
Figure 3.5 shows the number of functional units in each kind in the 4-core archi-
tecture. The functional units are classified as scalars, matrix units and graph units.
For the workload 1, consisting of algorithms such as matrix multiplication, matrix
inversion and LU-Decomposition the number of matrix units increased after the opti-
mization process. For the workload 2, consisting of algorithms such as Convex-Hull,
Sorting and Graph Traversal, the number of graph units increased after the opti-
mization process. This contributed to the increase in performance (Fig. 3.1 and Fig.
3.2).
Figure 3.6 shows the impact of core formation for a single iteration of the op-
timization process. The amount of data transferred across cores has reduced after
highly communicating functional units were put together into groups towards core
formation. This reduced the communication latency which contributed to the in-
crease in performance. The impact of core formation is higher for the workload 2,
where there is a higher dependency among instructions and as a result more number
of functional units has to communicate with each other. After the core formation
process, these communications were localized within the core.
31
0
1000
2000
3000
4000
After Core Formation
Before Core Formation
Data
tra
nsf
erre
d a
cross
core
s in
MB
Matrix Based Algorithms
Graph Based Algorithms
Data Transfer across Cores
Figure 3.6: Impact of core formation - Amount of data transferred before and aftercore formation by application of graph partitioning
32
CHAPTER IV
WORK TO BE COMPLETED
From the results obtained for single temperature simulated annealing it can be
seen that though there is a significant improvement in the performance of the hetero-
geneous architecture, the process terminates before the input specifications are met
with. It can be seen that using single temperature simulated annealing it is difficult to
explore the large design space arising out of the fact that the number of architecture
parameters is very high in the case of heterogeneous multi-core architectures. We
therefore adopt a multiple temperature simulated annealing with different tempera-
tures for different components of the design space. Also, so far only performance has
been considered in the optimization process. After the development of power model
for the entire architecture, power aspects are to be included in the optimization as
well.
4.1 Power model of CUBEMACH based architectures
The power model of the various ALFUs has been already developed. Since all the
units in the architecture are based on MIP cells, the number of MIP cells in the vari-
ous functional blocks of the architecture is expressed as equations. This can be done
after representing the architecture of the functional blocks using MIP cells. Once
a generalized equation which gives the relation between the number of MIP cells of
each kind and the various architecture parameters is established, the power consumed
by the entire architecture based on the CUBEMACH design paradigm can be found
from the number of MIP cells used in building the architecture. The work to be done
here is to represent every functional block of the architecture in terms of MIP cells.
33
Then equations which gives the number of MIP cells used in the architecture, given
the value of the various architecture parameters have to established. This has already
been done for the ALFUs. Similarly, equations have to be formed for other units as
well.
4.2 Optimization of heterogeneous multi-core architecturesusing multiple temperature simulated annealing
The results shown in the proposal made use of the single temperature simulated
annealing as the global optimization process. The work to be done includes im-
plementing a multiple temperature simulated annealing the advantages of which is
described in section 2. Closely related parameters have to be grouped together and
a starting temperature of each of this group has to be assigned. The temperature
schedule and the equation for the schedule has to determined for each of these groups
involved in the process. The temperature of each group will be dependent on the
distribution of parameters in that group.
The above diagram shows a limited number of parameters of a heterogeneous
Number of MatMul
Units
Number of MatAdd
units
Number of MatInv Units
Number of LUD Units
Number of Scheduler ports
Scheduler table sizeParallel
Scheduler Units
Cache Size
Block SizeMemory Ports
ONNET PortsBuffer Size
Switching Rate
MIN Stages
Figure 4.1: Multiple Temperature Simulated Annealing
multi-core architecture based on the CUBEMACH design paradigm. The tempera-
ture corresponding to each parameter is represented by a colour. As shown in the
34
diagram, different groups of closely related parameters have a common temperature
associated with them. During the optimization process each of these parameters are
explored according to their temperature. This process is yet to be implemented and
is expected to improve the performance obtainable from the resulting architecture.
35
REFERENCES
[1] S. Kirkpatrick et al., Optimization by Simulated Annealing, In Science 220, 1983.
[2] BW Kernighan, S Lin, An efficient heuristic procedure for partitioning graphs, In
Bell System Technical Journal, 1970.
[3] J. von Neumann et al., Theory of Games and Economic Behavior, 3rd ed. In New
York: Wiley, 1964.
[4] R. Kumar et al., Core architecture optimization for heterogeneous chip multipro-
cessors, In Proceedings of the 15th international conference on Parallel architec-
tures and compilation techniques, 2006.
[5] S. Kang et al., Magellan: a search and machine learning-based framework for
fast multi-core design space exploration and optimization, In Proceedings of the
conference on Design, automation and test in Europe, 2008.
[6] P. J. M. van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theory and
Applications. Mathematics and Its Applications. In Springer, Kluwer Academic
Publishers, Norwell, 1987.
[7] Andreas Nolte and Rainer Schrader. A note on the finite time behaviour of sim-
ulated annealing. In Operations Research Proceedings, 1996.
[8] Ross P.E. , Why CPU Frequency Stalled, In IEEE Spectrum, March, 2008.
[9] Geer D. , Chip makers turn to multicore processors, In Computer Vol.8 Issue 35,
May, 2005.
[10] In http://www.intel.com/products/processor/pentium_dual-core/index.
htm
36
[11] Richard McDougall and James Laudon, Multi-core microprocessors are here, In
USENIX Magazine, Vol 31, October 2006
[12] Shah M. et al. , UltraSPARC T2: A highly-treaded, power-efficient, SPARC
SOC, In IEEE Asian Solid-State Circuits Conference, , 2007
[13] Le, H. Q. et al. , IBM POWER6 microarchitecture, In IBM Journal of Research
and evelopment Vol.51 Issue 6, Nov 2007.
[14] In http://www.intel.com/consumer/products/processors/corei7.htm
[15] Dorsey J. et al., An Integrated Quad-Core Opteron Processor, In IEEE Interna-
tional Solid-State Circuits Conference, 2007. ISSCC 2007.
[16] Kahle J. A. et al., Introduction to the Cell multiprocessor, In IBM Journal of
Research and Development , 2005
[17] T. Chen et al., Cell Broadband Engine Architecture and its first implementation-
A performance view, In IBM Journal of Research, 2007.
[18] M. Gschwind et al., Synergistic Processing in Cell’s Multicore Architecture, In
MICRO, 2006.
[19] In http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_
Fermi_Compute_Architecture_Whitepaper.pdf
[20] Gupta, R.K. et al., Introducing core-based system design, In IEEE Design and
Test of Computers, 1997
[21] Ravindhiran Mukundrajan, Towards Modeling and Design Automation of Su-
percomputing Clusters: SCOC IP Cores and its application to biological neu-
ronal network modeling http://www.warftindia.org/Thesis/Vishwakarma/
Ravi_Thesis.pdf, Thesis submitted to WARFT
37
[22] R. Kumar et al., Single-ISA Heterogeneous Multi-Core Architectures for Mul-
tithreaded Workload Performance, In ACM SIGARCH Computer Architecture
News, 2004.
[23] Venkateswaran N et al., On the concept of simultaneous execution of multiple
applications on hierarchically based cluster and the silicon operating system, In
IEEE International Symposium on Parallel and Distributed Processing, 2008
[24] Venkateswaran N et al., Custom Built Heterogeneous Multi-core Architectures
(CUBEMACH): Breaking the conventions, In IPDPSW, 2010
[25] Shyamsundar G., Homogeneously structured heterogeneous functional cores for
SuperComputer on a Chip: MIP Paradigm based Design Space http://www.
warftindia.org/Thesis/Vishwakarma/ShyamSep2007.pdf, Thesis submitted to
WARFT
[26] Balaji Subramaniam. Towards Modeling And Intergrated Design Automation Of
Supercomputing Clusters(MIDAS): On Chip Networks for SCOC. Thesis submit-
ted to WAran Research FoundaTion(WARFT).
[27] Aravind Vasudevan. Towards Modeling And Intergrated Design Automation Of
Supercomputing Clusters(MIDAS): Hierarchical Compiler On Silicon. Thesis sub-
mitted to WAran Research FoundaTion(WARFT).
[28] Piotr Czyzak, Adrezej Jaszkiewicz, Pareto simulated annealinga metaheuristic
technique for multiple-objective combinatorial optimization, In Journal of Multi-
Criteria Decision Analysis, Jan 1998
[29] Venkateswaran N et al., Memory in processor: A novel design paradigm for
supercomputing architectures, in ACM SIGARCH Computer Architecture News
archive Volume 32, Issue 3, 2004.
38
[30] Venkateswaran N et al., Memory In Processor- Supercomputer On a Chip:
Processor Design and Execution Semantics for Massive Single-Chip Perfor-
mance, in 19th IEEE International Parallel and Distributed Processing Symposium
(IPDPS05)
[31] H.J.M Veendrick, Short-circuit dissipation of static CMOS circuitry and its im-
pact on design of buffer circuits. In IEEE journal of solid state circuits, August
1984.
[32] Wang, Q. et al., A new short circuit power model for complex CMOS gates, In
Proceedings IEEE Alessandro Volta Memorial Workshop, 1999
39