algorithm and architecture

Upload: depanshu-gola

Post on 14-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Algorithm and Architecture

    1/11

    Algorithm and Architecture-level Design Space Exploration Using

    Hierarchical Data Flows

    Helvio P. Peixoto and Margarida F. Jacome

    ECE Dept., The University of Texas at Austin, Austin, TX 78712

    {peixoto|jacome}@ece.utexas.edu

    Abstract

    Incorporating algorithm and architecture level design space exploration in the early phases of

    the design process can have a dramatic impact on the area, speed, and power consumption of the

    resulting systems. This paper proposes a framework for supporting system-level design space

    exploration and discusses the three fundamental issues involved in effectively supporting such an

    early design space exploration: definition of an adequate level of abstraction; definition of good-

    fidelity system-level metrics; and definition of mechanisms for automating the exploration process.

    The first issue, the definition of an adequate level of abstraction is then addressed in detail. Specif-ically, an algorithm-level model, an architecture-level model, and a set of operations on these

    models, are proposed, aiming at efficiently supporting an early, aggressive system-level design

    space exploration. A discussion on work in progress in the other two topics, metrics and automa-

    tion, concludes the paper.

    1 Introduction

    The ability to perform a thorough algorithm and architecture level design space exploration inthe early phases of the design process can have a dramatic impact (orders of magnitude!) on thearea, speed, and power consumption of a system.[1][2] Gains in performance achieved by parallel-izing specific code segments, while taking into consideration the communication overhead intro-duced by such parallelism, are among the area-speed trade-offs that can be explored at thealgorithmic level. Taking advantage of different execution speeds of hardware and software, and/orpipelining computations, are among the strategies that can be pursued at the architectural level inorder to find solutions that can cost-effectively meet the systems constraints. For the growing por-table consumer electronics market, power consumption has also become a critical design con-cern.[1][2] As with area and speed, much can be done at the algorithmic and architectural levels tominimize the systems average power consumption, as discussed in [1]-[4]. For instance, in a typi-cal signal processing application, the ratio of the critical path to the sampling rate indicates thepotential for power reduction by lowering the voltage while still meeting the throughput require-ments. Trade-offs between area and power can be achieved by using parallelism for voltage scal-

    ing. By identifying spatially local sub-structures, i.e., isolated, strongly connected clusters ofoperations within an algorithm, one can do partitioning so as to minimize global buses.Unfortunately, as the system size increases, so does the design space - as a consequence, algo-

    rithm and architecture-level design space exploration has been, so far, quite limited in practice.[5]-[12] This paper describes a model that supports efficient algorithm and architecture level designspace exploration of complex systems subject to timing, area, and power constraints. The goal ofthis early exploration is to identify, in a reasonable amount of time, one or more regions of thedesign space that potentially hold a cost-effective solution, given an algorithmic level functionaldescription and a set of constraints and optimization goals. A design space region is defined by a

    Helvio P. Peixoto and Margarida F. Jacome, Algorithm and Architecture-level Design Space Exploration Using Hierarchical Data

    Flows, 11th International Conference on Application-specific Systems, Architectures and Processors, ASAP-97, Zurich, 1997

  • 7/29/2019 Algorithm and Architecture

    2/11

    broad architectural specification, which consists of: (1) a partition of the systems algorithmicdescription into a set of algorithmic segments; (2) and the definition of a set of hardware and/orsoftware architectural components and interfaces, for implementing these algorithmic segments.During this early, aggressive design space exploration, only system level design issues are consid-ered -- the resulting broad architectural specification(s) should thus be seen as inputs to the nextsteps of the top-down design process.

    Figure 1 summarizes our approach to early, system-level design space exploration. The search

    for a cost-effective system-level architecture is an iterative process where algorithm and architec-ture level design space exploration are synergistically combined, as briefly discussed in the sequel.Two types of models are maintained: (1) algorithmic models that represent the system behavior;and (2) architectural models that represent possible architectures for implementing the systembehavior. At the algorithmic level, timing, area, and power budgets are assigned to the variouscomputations defined in the model. Such assignments are based on: (1) the constraints and optimi-zation goals stated in the system specification; and (2) actual measures on the relative complexityof these computations (considering a particular implementation). Behavior preserving algorithmictransformations, such as parallelization of code segments or loop unrolling, are then applied dur-ing the algorithmic exploration, and their impact on the budgets assessed. At the architecturallevel, different partitions and mappings of the system behavior into architectural components(hardware or software) are explored. A library of components, properly characterized for theeffect, defines the alternatives that are viable for implementing the particular system. Roughestimates on performance, area, and power of a particular architectural solution are then comparedwith the algorithmic budgets, in order to assess the solutions potential for meeting the specifiedconstraints.

    The remainder of this paper is organized as follows. Sections 2 and 3 describe the proposedalgorithmic-level and architecture-level models, respectively, and identify the set of basic opera-tions that should be implemented on these models in order to effectively support system-leveldesign space exploration. Section 4 discusses the importance of deriving good fidelity system-level metrics for individual and comparative assessment of competing architectural solutions, inorder to adequately guide the design space exploration. Some conclusions and discussion of workin progress are given in Section 5.

    2 Algorithm Level of Abstraction

    2.1 Algorithmic Representation

    In our approach, the specification of a system comprises: (1) an algorithm-level behavioraldescription (written in C or VHDL), annotated with probabilistic bounds on data dependent loop

    System Level

    System Specification

    Architectural Specification

    Promising Design Space Regions

    Algorithm-Level Design Space Exploration

    Architecture-Level Design Space Exploration

    Budgets AlgorithmicModel

    Estimates ArchitecturalModel

    Library ofComponents

    Library ofAlgorithms

    Figure 1 System-level design space exploration

    Design SpaceExploration

  • 7/29/2019 Algorithm and Architecture

    3/11

    indices, and with transition probabilities on alternative control paths (defined by conditional con-structs); (2) a set of timing constraints; and (3) a set of optimization goals on area and/or power.

    The algorithmic model is constructed by compiling the input specification into a set of hier-archical data flow graphs, [13]-[16] where nodes represent computations and edges represent datadependencies. The resulting model is hierarchical since basic blocks, defined by alternative (flowof) control paths, are first represented as single complex nodes. Such complex nodes are thenrecursively decomposed (i.e., expanded) into sub-graphs, containing less complex nodes, until the

    lowest level of granularity (i.e., the single operation level) is reached. These last nodes are desig-nated atomic nodes. A complex node is thus a contraction of nodes of lower complexity (i.e.,smaller granularity). Complex nodes contain key information about their hierarchically dependentsub-graphs, such their critical path. Edges contain information on the data (tokens) produced andconsumed by the various nodes in the graph, including the specification of their type and size.Such information is essential for determining the specific physical resources required by eachatomic node. (For example, we may have some nodes performing floating-point multiplicationswhile other nodes perform just integer multiplications.) The graphs in are polar, i.e.,have the beginning and the end of their execution synchronized by a source and a sink nodes,respectively. [15] Sink and source nodes have no computational cost, they are used simply to assertthat graphs contracted into complex nodes represent basic blocks defined by alternative controlflow paths.

    In our proposed algorithmic model, hierarchy is represented by containment -- in order words,each graph may be considered at its higher level of abstraction or may have some of its complexnodes expanded into their corresponding subgraphs. Thereby, the level of granularity of the model(and thus its size and complexity), can be dynamically adjusted, according to the needs of thedesign space exploration.

    Table 1 summarizes the various types of nodes supported in the algorithmic model and thefiringrules defined for these node types.[13][14] Three types of firing rules are defined: input-conjunc-tive/output-conjunctive, input-conjunctive/output-disjunctive, and input-disjunctive/output-con-junctive. Nodes with input-conjunctive/output-conjunctive firing rules become enabled forexecution only when all incoming edges have data available (i.e., have tokens present), and afterexecuting, they produce data (tokens) in all output edges. The source, sink, basic operation, andbasic block nodes shown in Table 1 have input-conjunctive/output-conjunctive firing rules. Nodes

    with input-conjunctive/output-disjunctivefiring rules, the second type, become enabled for execu-tion only when all incoming edges have data available (i.e., tokens) but, differently from the previ-ous case, they produce data only in a single output edge. Nodes with execution semantics based onthis firing rule are denoted selectsince, as a result of their execution, a specific (alternative) controlflow path is selected. Finally, nodes with input-disjunctive/output-conjunctive firing rules becomeenabled for execution as soon as data is available in one of their input arcs, and then produce datain all output edges, as in the first case. Nodes with execution semantics based on this firing rule aredenoted merge since they mark the end of a set of alternative control flow paths defined by a previ-

    Type of Nodes Granularity Firing Rules Icon

    Source Atomic input-conjunctive/output-conjunctive

    Sink Atomic input-conjunctive/output-conjunctive

    Merge Atomic input-disjunctive/output-conjunctive

    Basic Operation Atomic input-conjunctive/output-conjunctive

    Basic Block Complex input-conjunctive/output-conjunctive

    Select Complex input-conjunctive/output-disjunctive

    Table 1 Classification on Nodes Supported in the Algorithmic Model

    G V E,( )

    +

    ?

  • 7/29/2019 Algorithm and Architecture

    4/11

    ous select node. Merge and select nodes are thus used, in combination, to model the loops andconditionals that may be present in a behavioral description, as illustrated in the example thatfollows.

    Consider the small segment of the QRS behavioral description [19] and its correspondent algo-rithmic model, shown in Figure 2.1 To facilitate the discussion, the hierarchy of basic blocksdefined in procedure p is identified by the dotted boxes shown in the figure. Procedure p is thuscomposed of a single assignmentinstruction followed by a basic block, this last abstracting the

    entire while statement. The corresponding top level graph, denoted G in the figure, is thus com-posed of four nodes: a basic operation node (for the assignment), a basic block node (abstractingthe while statement) and the required pair of sink and source nodes. The only complex node in(the while basic block node) is shown expanded in graph -- note that the condition and thebody of the while statement are represented by the select and the basic block nodes shownin . Their expansion results in graphs and , respectively. Finally, graph representsthe ifstatement marked by the inner most dotted box in Figure 2. The select node, representingthe if clause, is shown expanded in graph , and the basic block nodes defined by the thenand the else parts of the statement are shown expanded in and , respectively.

    As mentioned above, the algorithmic description provided in the system specification isrequired to be annotated with profiling information providing: (1) upper bounds and mean valuesfor data dependent loop indices; and (2) relative frequencies of execution for basic blocks con-tained in alternative control flow paths (defined by if-then-else or switch conditional state-ments). This information is important when, in the presence of data dependent constructs, one stillwants to be able to reason about (average and worst case) execution delays and power consump-tion.2 Accordingly, based on the profiling information referred to above, a function is defined

    1This QRS is an algorithm used for monitoring heart-rate in ECG applications.

    procedure p(){

    i := 1;

    while (i ymax)

    ymax := ysi;i++; }}

    ?

    Figure 2 Example: Algorithmic description and corresponding Algorithmic Model

    -

    -

    /+

    -

    +

    ?

    :=

    v u

    noop

    G4

    G1

    G3

    G2G6

    G7

    :=

    G

    G5

    GG1

    G1 G2 G3 G4

    G5G6 G7

  • 7/29/2019 Algorithm and Architecture

    5/11

    which associates a transition probability to every edge sourcing from a selectnode or sinking intoa merge node -- such a function indicates the probability that the particular node will consume/pro-duce tokens from/to the specific edge. Obviously, the summation of the transition probabilities ofall arcs sourcing (sinking) from a select (merge) node has to be one.

    The function is used to estimate the average and the worst case execution delays and powerconsumption for all nodes in the model hierarchy. In the example that follows, we illustrate howaverage and worst case execution delays are estimated. Consider the graph shown in Figure 2.

    Assume that the execution delay of all atomic nodes is 1 reference operation,3 except for thesink, source, merge, and no-op nodes, which are assumed to have zero execution delay. If the tran-sition probabilities for edges u and v are, for example, and , then theaverage execution delay for (and thus the average execution delay of the corresponding con-tracted node, shown in ), is 1.8 reference operations. The critical path (or worst case executiondelay) is defined, instead, as the longest path from source to sink. Since contains two alterna-tive control flow paths, its critical path will thus be the longest path from source to sink eitherthrough edge u or through edge v. (Considering the execution delays provided in this example, theactual critical path of is 2 reference operations, and is defined through edge u.) In the case ofgraphs directly expanding loop statements, such as , transition probabilities are also defined (forthe corresponding merge and select nodes) based on the upper bounds and on the mean values pro-vided for the corresponding loop indices -- these transition probabilities are then similarly used todetermine the worst case and the average execution delays for the particular loop nodes.

    During architecture level design space exploration, the specific resources that will execute eachatomic node defined in the algorithmic model are selected from the components library. Thisdesign decision provides basic information on execution delay and power consumption for all ofthe atomic nodes in the model. For example, a specific multiplier may be defined as having an exe-cution delay of 3 reference operations (normalized time measure) and as consuming 4 power unitsper reference operation (normalized power measure), while another resource, say, an adder mayonly take 1 reference operation and consume a single power unit per reference operation. (Notethat, for atomic nodes, the worstcase and the average for delay and also for power consumptionare assumed to be the same4). Starting from those normalized measures for the atomic nodes, thehierarchy is then traversed, bottom-up, and the worst case and average execution delay andpowerconsumption for all complex nodes in the model are computed, as discussed in the previous exam-ple. This permits a quick identification of (potential) constraint violations at any level of the hierar-

    chy, by comparing these values with individual node budgets, derived from the systemspecification, as discussed in the sequel.The area and power constraints (given in the system specification) are translated into delay and

    power budgets for the various nodes in the algorithmic model. These budgets are defined top-down, considering the relative time complexity and the relative power requirements of the variousnodes in the model, i.e., considering the execution delay and the power consumption values com-puted for these nodes during the previous step (recall that all measures were normalized to a refer-ence operation.) Hierarchy is explored during the budgeting process, since the budgets for acomplex node are distributed through the nodes contained in its expansion graph. Budgets thusestablish upper-bounds on worst case execution delays, average power consumption, etc., for eachnode in the hierarchy. Feasibility is assessed by comparing the node budgets with the actual valuesrequiredby the current node implementation, which can be determined by resolving the referenceoperation. Nodes that violate constraints at any level of the hierarchy can be thus quickly identi-

    fied, and then locally optimized.

    2Analysis ofprobabilistic constraints can be also performed using the proposed algorithmic model (in addition tothe worst case analysis to be discussed next). For a detailed discussion on algorithms for performing probabilisticconstraint analysis using the algorithmic model discussed in this paper see [21].

    3The reference operation is a normalized measure used in the components library.4At system-level, it would be prohibitively expensive to consider otherwise.

    G4

    u( ) 0.8= v( ) 0.2=G4

    G3G4

    G4G1

  • 7/29/2019 Algorithm and Architecture

    6/11

    2.2 Algorithmic Model: Basic Operations

    Transformations at the algorithmic level are modifications to the graph structure that change thenumber, type and/or precedence between nodes within the graph, while preserving its overallinput/output behavior. Examples are algorithm substitution, algorithm parallelization, loop unroll-ing, common sub-expression elimination/replication, constant propagation, operation strengthreduction, function in-lining, etc. Due to the hierarchical nature of the algorithmic model, it is pos-

    sible to apply these transformations to functions of any size and complexity, or to any other typesof basic blocks, or even to individual operations, as in the case of operation strength reduction. Forexample, at the functional level, one may try a completely different implementation of a givenalgorithm or, instead, just try to parallelize a given block of operations, in order to reduce the criti-cal path. In the example given in Figure 2, one could, for example, easily substitute the entiregraph , i.e., the node that contracts it, with another node, derived by applying loop unrollingand sub-expression elimination to the original code segment. Note that graph would remain thesame, as well as all of the other graphs located above in the hierarchy.

    3 Architecture Level of Abstraction

    3.1 Architectural Representation and Component Libraries

    As mentioned above, the design decisions made during architecture-level exploration are com-piled in a broad architectural specification, which comprises: (1) a first-cut partitioning of thealgorithmic model into a set ofalgorithmic segments, each of which is to be implemented by anindividual architectural component; and (2) a set of fundamental design decisions on the imple-mentation of these architectural componentand their interfaces.

    Architectural components are single ICs -- they thus define a space for resource sharing andthey also define an independent clock. Architectural components can be implemented in hardware,meaning that the corresponding architectural component will be an ASIC or an FPGA, or canalternatively be implemented in software, meaning that the corresponding architectural componentwill be realized by an off-the-shelf micro-controller, DSP, or general purpose processor. The deci-sion of implementing a giving system component in hardware or in software is done by selecting a

    target component library that properly describes the required implementation technology. So farwe have mostly concentrated on defining hardware libraries tailored to ASIC designs. In theselibraries, components, such as multipliers and adders, are characterized in terms of their delay,area, power consumption, voltage supply, etc. Unless otherwise stated, the discussion from thispoint on focuses on hardware architectural components only.

    The basic elements of the proposed architectural model are: architectural components, modules,and physical resources. In order to efficiently support architectural exploration, three additionalelements were added to the algorithmic model: algorithmic segments, clusters of nodes, andpipe-line stages. In this section we focus on precisely defining those various model elements and in Sec-tion 4 we discuss their use during the design space exploration.

    As referred to above, the original algorithmic model is partitioned into a set of algorithmic seg-ments -- an algorithmic segmentthus comprises one or more (sub)graphs belonging to an algorith-mic model.5 Figure 3(a), for example, shows two algorithmic segments defined on a simple

    algorithmic model that contains only one top-level graph with five basic block nodes.Architectural components are ICs that implement algorithmic segments using a given target

    library, i.e., a given universe of components (see Figure 4). The implementation of an algorith-mic segment (by an architectural component) is defined as the implementation of the various clus-ters of nodes defined for the particular segment, as discussed next. A cluster of nodes is a set ofone or more connected atomic nodes belonging to a graph within an algorithm segment -- Figure

    5Some transformations may need to be performed in nodes at the interface of a partition but, for simplicity, we omitthem here.

    G1G

    G1

  • 7/29/2019 Algorithm and Architecture

    7/11

    3(b), for example, shows three clusters defined within a particular graph. In order to completelydefine an architecture component, every atomic node within an algorithm segment must thusbelong to one and only one cluster. Clusters of nodes are implemented by modules -- a moduleconsists of one or more physical resources (specifically, functional units), optimally intercon-nected for executing a specific cluster of nodes6. If no scheduling conflicts exist, modules can beshared among isomorphic clusters. For example, Module 1 in Figure 3(b) implements a multiplica-

    tion and is used by only one single cluster, while Module 2 implements an addition followed by asubtraction and is shared by two isomorphic clusters.

    We conclude the discussion by introducing the notion of pipeline stage. As mentioned in Sec-tion 2, the graphs that comprise an algorithmic description are polar, i.e., have a source and a sinknode that synchronize the beginning and the end of the execution of the graph. By default, each ofsuch graphs is fully enclosed in what we call a pipeline stage -- specifically, this means that thesource node can only be fired after the sink node has completed its execution. This default schedul-ing policy can be modified, though, by increasing the number of pipeline stages defined within agraph. Figure 3(c), for example, shows a graph where two pipeline stages were defined -- thenodes enclosed in each of the two stages shown in the figure can now execute concurrently withrespect to each other.

    3.2 Architectural Model: Basic Operations

    Operations at the architectural level are aimed at changing the structure or type of architecturalcomponents, clusters, modules and pipeline stages. These basic operations include: (1) splittingand merging of algorithmic segments and/or modification of the target component library for asegment; (2) migration of node clusters across algorithmic segments, checking for invalid condi-tions due to resource sharing; (3) merging and splitting of clusters, checking for invalid conditions,and automatically generating new modules, when applicable; (4) definition of pipeline stageswithin a graph; (5) algorithm substitution;7 and (6) substitution of individual physical resourceswithin modules.

    4 Metrics and Design Space Exploration

    Sections 2 and 3 discussed the level of abstraction at which the proposed system level designspace exploration is to be conducted -- properly defining this level of abstraction is of paramountimportance in order to enable an aggressive, early exploration of design spaces of such huge

    6As will be discussed in Section 4, clusters and modules are used to quantify the potential for implementation over-head minimization.

    7Blocks of statements can be substituted by different, functionally equivalent blocks. This allows us, for instance, toexperiment with different loop factorizations.

    A

    B

    C D

    E

    A

    B

    C D

    E

    +

    -

    * +

    -

    Module2

    Module1

    Figure 3 Example of (a) Algorithm Segments; (b) Clusters and Modules; and(3) Pipeline Stages

    (b) (c)(a)

    PipeStage1

    PipeStage2AlgorithmicSegment 1

    Algorithmic

    Segment 2

  • 7/29/2019 Algorithm and Architecture

    8/11

    dimensions. A second, equally fundamental issue in supporting early design space explorationconcerns to the ability to evaluate the absolute and relative goodness of competing solutions,i.e., broad architectural specifications, with respect to the area, speed and power constraintsdefined in the specification.

    One of the fundamental difficulties involved in accurately estimating performance, area, andpower consumption at any level of design abstraction other than the physical level, is the ability toprecisely account for the physical resources that are abstracted, i.e., not yet defined at the partic-

    ular level. As suggested in [3][4], resources can be broadly classified into two categories: algo-rithm inherent and implementation overhead.Algorithm inherent resources are the functional units(multipliers, adders, etc.) needed to implement the operations defined in the systems algorithmicdescription -- such resources are tangible and can be reasoned upon at the system-level of abstrac-tion.Implementation overhead resources account for all of the other resources needed to correctlyexecute such operations, including control logic, steering logic (e.g, buses and MUXES), registers,and wiring. Implementation overhead resources are not yet defined at the system-level of abstrac-tion.

    Unfortunately, implementation overhead can significantly contribute to the delay, power con-sumption and area of a design. Thereby, ignoring implementation overhead, by performing trade-offs only in terms of functional units, may lead to sub-optimal design solutions. For example, atthe architecture level of abstraction, one may attempt to trade-off speed for area by increasingthe level of resource sharing in a given design. However, due to the corresponding increase in con-trol complexity and to a possible increase in number and/or size of buses, etc., the overall area ofthe design may actually increase! Similarly, at the algorithm level of abstraction, one mayattempt to reduce power consumption using, for example, operation strength reduction. A recentcase study illustrates the results of applying one of such transformations -- multiplications sub-stituted by adds and shifts -- to fourteen Discrete Cosine Transform (DCT) algorithms.[17]Interestingly, in all cases the overall power consumption has actually increasedas a result of thetransformation -- in half of the cases it actually increased by more than 40%! This was so becausethe power consumption in buses, registers and control circuitry has dramatically increased, insome cases by more than 400%, totally overshadowing the power savings in the functional units.

    The central issue thus becomes to be able to account for the relative impact of implementationoverhead in the performance, area, and power consumption of competing alternative solutionswithout having hard information about these resources. In order to tackle this problem, we are

    working on an approach that combines:(1) A set of metrics designed to rank competing solutions based on their potential to takeadvantage of specificproperties that correlate to minimal implementation overhead. Exam-ples of such properties are given below.

    (2) A methodology for statistical characterization ofimplementation overheadarea, delay, andpower consumption, taking into consideration the previous measures (i.e., considering thedegree to which the properties referred to above are taken advantage of by a particular solu-tion). This second set of metrics, together with metrics that account for area, delay, andpower consumption in algorithm inherent resources,8 are used for feasibility analysis, i.e.,for performing a rough assessment on the ability of a solution to meet its area, delay, andpower budgets.

    The properties referred to in point (1) constitute the essential mechanism in our approach toidentify good quality system level solutions, i.e., broad architectural specifications that have

    potential to minimize implementation overhead. In what follows, we briefly discuss three of suchproperties:9

    8Note that accounting for area, power consumption and delay in algorithm inherent resources does not pose majordifficulties -- all of the necessary information is provided in the corresponding target component library, as brieflydiscussed in Section 3.

    9The metrics to be discussed below and the precise definition of the statistical parameters to be used in (2) are stillwork in progress.

  • 7/29/2019 Algorithm and Architecture

    9/11

    Locality of Computations

    A group of computations within an algorithm is said to have a high degree of locality if thealgorithm level representation of those computations corresponds to an isolated, strongly con-nected (sub)graph. [18] More informally, the largest the volume of data being transferred amongthe nodes belonging to the cluster, in comparison with the volume of data entering and exitingthe cluster, the highest will be the degree of locality.10 By indeed considering such strongly con-

    nected sub-graphs as single clusters of nodes, and thus implementing them using modules opti-mized for performing the specific set of computations and data transfers, it is possible to minimizethe implementation overhead needed to execute the corresponding part of the algorithm. Namely,the required physical resources will be located in the proximity of each other, thus minimizing thelength of interconnect and/or buses. So, if an algorithm exhibits a good degree of locality of com-putations, i.e, has a number of isolated, strongly connected clusters of computation, such clustersdefine a clear way of organizing the functional units into modules so as minimize the number ofglobal buses, and possibly other implementation overhead resources, such as control and/or steer-ing logic.

    Regularity

    If a given algorithm, or algorithmic segment, exhibits a high degree ofregularity, i.e., requiresthe repeated computation of certain patterns of operations, it is possible to take advantage of suchregularity for minimizing implementation overhead. Consider for example the Fast Fourier Trans-form (FFT) algorithm shown in Figure 4.[20] For the sake of the discussion, assume that the FFTis to be computed in two clock cycles, one for each stage -- two multipliers, two adders, and twosubtracters are thus needed. Observe further that, as shown in Figure 4, the clusters 1 and 3 are iso-morphic sub-graphs, and the clusters 2 and 4 are also isomorphic subgraphs. It is thus possible todefine two modules (each of which containing one multiplier, one adder, and one subtracter), andthen use a single module for implementing all clusters within a given isomorphic group, as illus-trated in Figure 4. Modules can thus be optimized for executing not only one, but a number of iso-morphic clusters, thus leading to solutions that minimize the cost of resource sharing in terms ofimplementation overhead. We measure the degree of regularity of an algorithm by the number ofisomorphic sub-graphs that can be identified in its corresponding algorithmic model.

    Sharing Distance

    This metric considers the distance between clusters of nodes that share the same module.Theidea is to favor solutions that maintain a certain degree of locality in their module sharing policy,thus minimizing the need for global buses. Note that since the distances referred to above aremeasured in time (i.e., in number of reference operations between two consecutive executions of

    10The volume of data can be determined using the information provided in the edges.

    +

    *

    *

    +

    -

    -

    -

    -

    *

    *

    +

    +

    Figure 4 Clusters, Modules and Architecture Components.

    stage 1 stage 2

    Module2

    Module1

    ArchitectureComponent

    Libraries

    *

    *

    +

    -

    cluster 1

    cluster 2

    cluster 3

    cluster 4

    Component

  • 7/29/2019 Algorithm and Architecture

    10/11

    the same module), this metric also favors solutions that concentrate the use of modules in specifictime intervals, thus creating opportunities for clock gating.

    Measures involving the properties referred to above should be carefully considered duringdesign space exploration, as briefly discussed next. Locality of computations and regularity areintrinsic algorithmic properties -- they can thus be used to initially drive the algorithm level designspace exploration process. Taking maximum advantage of these properties may require an aggres-sive architecture level design space exploration, though. For example, an obvious goal is to main-

    tain strongly connected clusters of computation within the same architectural component. A lesstrivial task, though, is to attempt to define algorithmic segments (i.e., partitions) that maximize thedegree of regularity within each component. Such a strategy can unquestionably lead to orders ofmagnitude savings in implementation overhead and clearly illustrates the benefits of an aggressive,early design space exploration.11 In addition, within a given component, trade-offs between largermodules, optimizing large chunks of computations, vs. smaller modules, shared by a number ofsmaller isomorphic clusters, should be carefully considered. Obviously, the timing, power, andarea budgets for the several parts of the algorithmic description must be also considered during thedesign space exploration, through feasibility analysis. For example, if a given computationallyexpensive loop has stringent timing constraints, its parallelization may need to be considered, in anattempt to increase execution speed, even if locality will most probably be compromised by such aparallelization. Due to the complex interplay between all of the possible transformations and archi-tectural alternatives than can be considered, stochastic optimization will be initially used for thedesign space exploration.

    We finish by contrasting the system level exploration of architectural components discussed inthis paper with the detailed design of fixed components performed during behavioral synthesis. Inour approach, only candidate partitions that exhibit the highest (comparative) ranks in terms ofmaximizing overall degrees of locality, regularity, etc., are selected for further architectural explo-ration. The decisions of assigning modules to components during architectural exploration are alsovery different in nature from the allocation and binding of physical resources performed by abehavioral synthesis tool. Module allocations are done exclusively with the purpose of: (1) identi-fying heuristically good points within the selected promising design space region -- this is done byexploring local trade-offs in terms of locality, regularity, sharing distance, etc.; (2) quickly charac-terizing those alternative design points, i.e., estimating their area, performance, and power, basedon the statistics referred to in the beginning of this section -- those results are then used to decide

    on the adequacy of the overall design space region with respect to the specification.

    5 Conclusions and Work in Progress

    This paper proposes a framework, outlined in Figure 1, for supporting efficient algorithm andarchitecture level design space exploration of complex systems subject to timing, area, and powerconstraints. The fundamental issues involved in effectively supporting such an early design spaceexploration are: (1) defining the correct level of abstraction at which the exploration should be per-formed, i.e., identifying the truly system-level design issues; (2) being able to assess the qualityof competing solutions, using good fidelity system-level metrics; and (3) being able to automati-cally explore the design space. Issues (2) and (3) were briefly discussed in Section 4. The mainfocus of this paper was on the first issue, though. Specifically, on precisely defining the abstractionlevel at which early system-level design space exploration should be undertaken. Such level ofabstraction, defined by the models discussed in Sections 2 and 3, is an issue of paramount impor-tance if one wants to be able to support an aggressive and effective early exploration of the designspace.

    Sky, an environment for early design space exploration, is currently being prototyped in Java.Sky implements the algorithm-level and architecture-level models and operations discussed in

    11The search for alternative sub-graph groupings is too daunting a task to be manually performed by a system leveldesigner.

  • 7/29/2019 Algorithm and Architecture

    11/11

    Sections 2 and 3, and the (hardware) component libraries referred to in Section 3. The automaticcomputation and update of the timing/area/power budgets and the metrics are also being incorpo-rated in Sky. A companion paper will be soon submitted, focusing mostly on the second issueidentified above, i.e., on system-level metrics, and reporting some of our preliminary results.

    6 Bibliography

    [1] J. Rabaey and M. Pedram (eds.),Low Power Design Methodologies, Kluwer Academic Publishers, 1996.

    [2] M. Pedram, Power Minimization in IC Design: Principles and Applications, InACM Transactions on DesignAutomation of Electronic Systems, ACM Press, 1996.

    [3] J. Rabaey, L. Guerra, and R. Mehra, Design Guidance in the Power Dimension, In Proceedings of theICASSP95, May 1995.

    [4] R. Mehra and J. Rabaey, Behavioral Level Power Estimation and Exploration, In Proceedings of the FirstInternational Workshop on Low Power Design, ACM Press, 1994.

    [5] G. De Micheli and M. Sami (eds.),Hardware/Software Codesign, Kluwer Academic Publishers, 1996.

    [6] D. Gajski, F. Vahid, S. Narayan, and J. Gong. Specification and Design of Embedded Systems. P T R PrenticeHall, 1994.

    [7] R. Gupta and G. De Micheli, Hardware-Software Cosynthesis for Digital Systems, IEEE Design & Test ofComputers, v10(3), 1993.

    [8] R. Gupta, Co-Synthesis of Hardware and Software for Digital Embedded Systems, Kluwer Academic Publishers,1995.

    [9] R. Ernst, J. Henkel, and T. Benner, Hardware-Software Cosynthesis for Microcontrollers, IEEE Design &Test of Computers, v10(3), 1993.

    [10] D. Gajski and F. Vahid, Specification and Design of Embedded Hardware-Software Systems, IEEE Design &Test of Computers, v12(1), 1995.

    [11] P. Knudsen and J. Madsen, PACE: A Dynamic Programming Algorithm for Hardware/Software Partitioning,In Proceedings of the Fourth International Workshop on Hardware/Software Codesign, ACM Press, 1996.

    [12] B. Lin, S. Vercauteren, and Hugo De Man, Embedded Architecture Co-Synthesis and System Integration, InProceedings of the Fourth International Workshop on Hardware/Software Codesign, ACM Press, 1996.

    [13] K. M. Kavi, B. P. Buckles, and U. N. Bhat, A Formal Definition of Data Flow Graph Models, IEEE Transac-tions on Computers, C-35 (11), 1986.

    [14] P. C. Treleaven, D. R. Brownbridge, and R. P. Hopkins, Data driven and demand driven computer architecture,ACM Comput. Surv., Mar. 1982.

    [15] D. Ku and G. De Micheli,High-level Synthesis of ASICs under Timing and Synchronization Constraints, Klu-wer Academic Publishers, 1992.

    [16] G. De Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, 1994.

    [17] M. Potkonjak, K. Kim, and R. Karri, Methodology for Behavioral Synthesis-based Algorithm-level DesignSpace Exploration: DCT Case Study, to appear in Proceedings of the 35th Design Automation Conference,ACM Press, June 1997.

    [18] G. Schmidt, T. Strohlein.Relations and Graphs - Discrete Mathematics for Computer Scientists. Springer-Ver-lag, 1993.

    [19] P. R. Panda and N. Dutt, 1995 High Level Synthesis Design Repository, Technical Report #95-04, Universityof California - Irvine, Feb. 1995.

    [20] E. C. Ifeachor and B. W. Jervis,Digital Signal Processing, A Pratical Appoach, Addison-Wesley PublishersLtd., 1993.

    [21] G. Veciana and M. Jacome, Hierarchical Algorithms for Assessing Probabilistic Constraints on System Perfor-mance, Technical Report, The University of Texas at Austin, Mar 1997. (also submitted to a conference)