high level synthesis

High Level Synthesis

CSE 237D: Spring 2008 Topic #6

Professor Ryan Kastner

Ant System Optimization: Overview

?

Ants work corporately on the graph Each creates a feasible solution

Ants leave pheromones on their traces

Ant make decisions partially on amount of pheromones

Global Optimizations Evaporation: Pheromones dissipate over

time Reinforcement: Update pheromones

from good solutions

Quickly converges to good solutions

Solving Design Problems using AS

Problem model Define the solution space: create decision

variables Pheromone model

Global heuristic: Provides history of search space traversal

Ant search strategy Local heuristic: Deterministic strategy for

individual ant decision making Solution construction

Probabilistically derive solution from local and global heuristics

Feedback Evaluate solution quality, Reinforce good

solutions (pheromones), Slightly evaporate all decisions (weakens poor solutions)

Autocatalytic Effect

Max-Min Ant System (MMAS) Scheduling

Problem: Some pheromones can overpower others leading to local minimums (premature convergence)

Solution: Bound the strength of the pheromones

If , always a chance to make any decision If , the decision is based solely on

local heuristics, i.e. no past information is taken into account

MMAS RCS Formulation

Idea: Combine ACO and List SchedulingAnts determine priority listList scheduling framework evaluates the “goodness”

of the listGlobal heuristics permutation indexLocal heuristic – can use different properties

Instruction mobility (IM)Instruction depth (ID)Latency weighted instruction depth (LWID)Successor number (SN)

RCS: List Scheduling A simple scheduling algorithm

based on greedy strategies List scheduling algorithm:

1. Construct a priority list based on some metrics (operation mobility, numbers of successors, etc)

2. While not all operations scheduled1. For each available resource,

select an operation in the ready list following the descending priority.

2. Assign these operations to the current clock cycle

3. Update the ready list4. Clock cycle ++

Qualities depend on benchmarks and particular metrics

MMAS RCS: Global and Local Heuristics

Global heuristic: Pheromones : the favorableness of

selecting operation i to position j Global pheromone matrix Local heuristic:

Local metrics : Instruction mobility, number of successors, etc

Local decision making: a probabilistic decision

Evaporate pheromone and reinforce good solution

Pheromone Model For Instruction Scheduling

ij op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

Each instruction opi I associated with n pheromone trails where j = 1, …, neach indicates the favorableness of assign instruction i to position j

Each instruction also has a dynamic

local heuristic

ij

Ant Search Strategy

op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

Each run has multiple iterations Each iteration, multiple ants

independently create their own priority list

Fill one instruction at a time

op1

op2

op3

op4

op5

op6

op5

op4

op1

op6

op2

op3

Ant Search Strategy

Each ant has memory about instructions already selected

At step j ant has already selected j-1 instructions

jth instruction selected probabilistically

op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

op1

op2

op3

op4

op5

op6

op5

op4

op1

Ant Search Strategy

ij(k) : global heuristic (pheromone) for selecting instruction i at j position

j(k) : local heuristic – can use different properties Instruction mobility (IM) Instruction depth (ID) Latency weighted instruction depth (LWID) Successor number (SN)

, control influence of global and local heuristics

Pheromone Update

Lists constructed are evaluated with List Scheduling Latency Lh for the result from ant h Evaporation – prevent stigmergy and punish “useless” trails Reinforcement – award trails with better quality

Pheromone Update

op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

Evaporation happens on all trails to avoid stigmergy

Reward the used trails based on the solution’s quality

op1

op2

op3

op4

op5

op6

op5

op4

op1

op6

op2

op3

Max-Min Ant System (MMAS) Risks of Ant System optimization

Positive feedback Dynamic range of pheromone trails can increase rapidly Unused trails can be repetitively punished which reduce their

likelihood even more Premature convergence

MMAS is designed to address this problem Built upon original AS Idea is to limit the pheromone trails within an evolving bound

so that more broader exploration is possible Better balance the exploration and exploitation Prevent premature convergence

Max-Min Ant System (MMAS)

Limit (t) within min(t) and max(t)

Sgb is the best global solution found so far at t-1 f(.) is the quality evaluation function, i.e. latency in our case avg is the average size of decision choices

Pbest (0,1] is the controlling parameter Conditional prob. of Sgb being selected when all trails in Sgb have max

and others having min

Smaller Pbest tighter range for more emphasis on exploration

When Pbest 0, we set min max

Other Algorithmic Refinements

Dynamically evolving local heuristicsExample: dynamically adjust Instruction MobilityBenefit: reduce search space progressively

Taking advantage of topological sorting of DFG when constructing priority listEach step ants select from the ready instructions

instead from all unscheduled instructions Benefit: greatly reduce the search space

MMAS RCS Algorithm

RCS Results: Pheromones (ARF)

Benchmarks: ExpressDFG

A comprehensive benchmark for TCS/RCSClassic samples and more modern casesComprehensive coverage

Problem sizesComplexitiesApplications

Downloadable from http://express.ece.ucsb.edu/benchmark/

Auto Regressive Filter

Cosine Transform

Matrix Inversion

RCS Experimental Results

Heterogeneous RCS – multiple types of resources (e.g. fast and normal multiplier) ILP (optimal) using CPLEX List scheduling

Instruction mobility (IM), instruction depth (ID), latency weighted instruction depth (LWID), successor number (SN)

Ant scheduling results using different local heuristics (Averaged over 5 runs, each run 100 iteration with 5 ants)

Benchmark(nodes/edges)

Resources CPLEX(latency/runtime)

ForceDirected

List Scheduling MMAS-IS(average over 5 runs)

IM ID LWID SN IM ID LWID SN

HAL(21/25) la, lfm, lm, 3i, 3o

8/32 8 8 8 9 8 8 8 8 8

ARF(28/30) 2a, lfm, 2m 11/22 11 11 13 13 13 11 11 11 11

EWF(34/47) la, lfm, lm 27 /24000 28 28 31 31 28 27.2 27.2 27 27.2

FIR1 (40/39) 2a, 2m, 3i, 3o 13/232 19 19 19 19 18 17.2 17.2 17 17.8

FIR2(44/43) la, lfm, lm, 3i, 3o

14/11560 19 19 21 21 21 16.2 16.4 16.2 17

COSINE 1(66/76)

2a,2m, lfm, 3i, 3o

18 19 20 18 18 17.4 18.2 17.6 17.6

COSINE2(82/91) 2a,2m, lfm, 3i, 3o

23 23 23 23 23 21.2 21.2 21.2 21.2

Average 18 18.2 19.3 20.5 18.5

16.8 17.0 16.9 17.1

RCS Experimental Results

Homogenous RCS – all resources have unit delay New benchmarks (compared to last slide) too large for ILP

MMAS RCS: Results

Consistently generates better results over all testing cases

Up to 23.8% better than list scheduler

Average 6.4%, and up to 15% better than force-directed scheduling

Quantitatively closer to known optimal solutions

Idea: Combine ACO and Force Directed Scheduling Quick FDS review

Uniformly distribute the operations onto the available resources.

Operation probability

Distribution graph

Self force: changes on DG of scheduling an operation Predecessor/successor force: implicit effects on DG Schedule an operation to a step with the minimum force

MMAS TCS Formulation

ACO Formulation for TCS

Initialize pheromone model While (termination not satisfied)

Create ants Each ant finds a solution Evaluate solutions and update pheromone

Report the best result found

+

S

+ <

-

-

E

1

2

3

4

v2v1

v3

v4

v5

vn

v6

v7 v8

v9

v10

v11

++

SS

++ <<

--

--

EE

1

2

3

4

v2v1

v3

v4

v5

vn

v6

v7

v8

v9

v10

v11

1

4

72τ

73τ

trails ij indicates the favorableness of assigning instruction i to position j

29




Report the best result found Select operation oph probabilistically

Select its timestep as following:

Global Heuristics: tied with the searching experience

Local Heuristics: use the inverse of distribution graph, 1/qk(j)

Here and β are constants




Report the best result found

Pheromone evaporation

Rewarding good partial solutions based on solution quality

Final Version of MMAS-TCS

Effectiveness of MMAS-TCS

MMAS TCS: Results

MMAS TCS is more stable than FDS, especially solution highly unconstrained

258 out of 263 test cases are equal to or better than FDS results

16.4% fewer resources

Design Space Exploration

DSE challenges to the designerEver increasing design optionsClosely related w/ NP-hard problems

Resource allocationscheduling

Conflict objectives (speed, cost, power, …) Increasing time-to-market pressure

Our Focus: Timing/Cost

Timing/Cost TradeoffsKnown applicationKnown resource typesKnown operation/resource mapping

Question: find the optimal timing/cost tradeoffs Most commonly faced problem Fundamental to other design considerations

Common Strategies

Usually done in an ad-hoc way Experience dependent

Or Scanning the design space withResource Constrained (RCS) or Time Constrained (TCS) scheduling

What’s the problem?RCS and TCS are dual problemsCan we effectively use information from one to guide

the other?

Design Space Model

Key Observations

A feasible configuration C covers a beam starting from (tmin, C) tmin is the RCS result for C

Design Space Model

Key Observations

A feasible configuration C covers a beam starting from (tmin, C)

Optimal tradeoff curve L is monotonically non-increasing as deadline increases

Design Space Model

Theorem

If C is the optimal TCS result at time t1, then the RCS result t2 of C satisfies t2 <= t1.

More importantly, there is no configuration C′with a smaller cost can produce an execution time within [t2, t1].

Theorem (continued)

What does it give us?

It implies that we can construct L:Starting from the rightmost tFind TCS solution CPush it to leftwards using RCS solution of CDo this iteratively (switch between TCS + RCS)

DSE Using Time/Resource Duality

Experiments

Three DSE approachesFDS: Exhaustively scanning for TCSMMAS-TCS: Exhaustively scanning for TCS MMAS-D: Proposed method leveraging duality

* Scanning means that we perform TCS on each interested deadline

DSE: MMAS-D vs. FDS

Experimental Results

Algorithm Runtime

Real Design Complications Heterogeneous mapping

One operation has many implementations Different bit-width, e.g. 32-bit multiplier good for mul(24) and mul(32) Different area and delay

Real technology library extremely sophisticated Hard to estimate final timing and total area

Sharing depends on the cost of multiplexers Downstream tools may not generate what we expect

Resource sharing, register sharing

Downstream tools break components’ boundaries Logic synthesis, placement and routing

Resource Allocation and Scheduling

Scheduling Cost function? Homogeneous TCS

Total number of component

Heterogeneous TCS Total area of functional units

FPGA designs: LUTs, slicecs, BRAMs, …

ASIC design: Silicon Area Total area comes from:

Functional units Register Multiplexers Interconnect

A hierarchical directed graph Nodes V: operations Edges E(vi,vj,Tij): timing constraints

Timing constraint Ti,j(c,o)Start time dependencies

Finish time dependencies

Chained dependencies

Towards Real World: Constraint Graph

Constraint Graph: Examples

Operations a and b scheduled at same

cycle

Operation b scheduled exactly one cycle after start of

Operation a

Operation b must start after Operation a

Operation a starts at least two cycles after start of

Operation b

Pipelined Designs

Start a new task before the prior one completed Feedback constraints

among nodes Specific initial interval

Improve throughput Requires more

hardware

Operation Chaining

Two or more operations scheduled in the same clock cycle Faster/larger component Shorter latency Saving registers

Chaining across clock edges

Speculative Execution

Problem Formulation Constraint graph

Nodes V: operations Edges E: data dependencies and

timing constraints Technology library Q

Area, timing Resource constraints

Desired clock period: C Determine start time and the

allocation of each resource type Resource constraint scheduling Timing constraint scheduling

MMAS CRAAS: Overview

Start with an initial results Using fastest components ASAP/ALAP Resolving resource

conflicts Meet timing and resource

constraints

MMAS iteratively searches optimal solutions

MMAS CRAAS: ASAP/ALAP Iteratively ASAP/ALAP

Handle loops/feedbacks in constraint graphCheck ill-posed timing constraint

MMAS CRAAS: Initial Schedule Resource conflicts

More than available resources are used in the ASAP results

Pushing operations forward

MMAS CRAAS: Overview

Individual ant constructs schedulesLoad ASAP timing resultsUpdate mobility range, operation probabilityUpdate distribution graphProbabilistically defer operationsProbabilistically select operationsSchedule operations using p(i,j,k)Update ASAP/ALAP results

MMAS CRAAS: Global Heuristics Local heuristics

Favor smaller functional units and less registers for this operation Uniform probability among all compatible resources

Global heuristics Favor solutions with smaller area

MMAS CRAAS: Scheduling

Defer operations from this iteration Favor operations with many options

Schedule an operation

Update ASAP schedules Update global heuristics

MMAS CRAAS: Results Implemented in a leading high-level synthesis framework

Leverage the HDL back-ends to collect results Front-end parses C and performs optimizations Resource sharing and register sharing after scheduling

The existing algorithm Based on FDS/FDLS Refined for real designs Force-directed operation deferring Re-allocate resources and iterative until area increasing

Results overview 3 - 15% smaller (optimizing area) 1-4% faster (optimizing latency)

MMAS CRAAS: Results

MMAS CRAAS: Results

Hard to generate good results with control-dominant designs (158, 160, and 54)

Better resource allocation and sharing Existing algorithm prematurely converges

Consistent with previous observations

Conclusions and Future Research

There is (was?) room for more work in fundamental algorithms; they make a difference on real designs

Ivory Tower: Most academics do not tackle real world problems Constraint graph with pipelining, speculation, chaining Actual delay and area (mux, interconnect, …)

Gripes: Extremely hard to validate new algorithms against old ones

(e.g. no open source code for FDS!) Backend (hooks into commercial tools a la Quartus) Benchmarks?!

high level synthesis

Documents