high level synthesis

67
High Level Synthesis CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner

Upload: kenyon-anderson

Post on 03-Jan-2016

38 views

Category:

Documents


1 download

DESCRIPTION

High Level Synthesis. CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner. ?. Ant System Optimization: Overview. Ants work corporately on the graph Each creates a feasible solution Ants leave pheromones on their traces Ant make decisions partially on amount of pheromones - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Level Synthesis

High Level Synthesis

CSE 237D: Spring 2008 Topic #6

Professor Ryan Kastner

Page 2: High Level Synthesis

Ant System Optimization: Overview

?

Ants work corporately on the graph Each creates a feasible solution

Ants leave pheromones on their traces

Ant make decisions partially on amount of pheromones

Global Optimizations Evaporation: Pheromones dissipate over

time Reinforcement: Update pheromones

from good solutions

Quickly converges to good solutions

Page 3: High Level Synthesis

Solving Design Problems using AS

Problem model Define the solution space: create decision

variables Pheromone model

Global heuristic: Provides history of search space traversal

Ant search strategy Local heuristic: Deterministic strategy for

individual ant decision making Solution construction

Probabilistically derive solution from local and global heuristics

Feedback Evaluate solution quality, Reinforce good

solutions (pheromones), Slightly evaporate all decisions (weakens poor solutions)

Page 4: High Level Synthesis

Autocatalytic Effect

Page 5: High Level Synthesis

Max-Min Ant System (MMAS) Scheduling

Problem: Some pheromones can overpower others leading to local minimums (premature convergence)

Solution: Bound the strength of the pheromones

If , always a chance to make any decision If , the decision is based solely on

local heuristics, i.e. no past information is taken into account

Page 6: High Level Synthesis

MMAS RCS Formulation

Idea: Combine ACO and List SchedulingAnts determine priority listList scheduling framework evaluates the “goodness”

of the listGlobal heuristics permutation indexLocal heuristic – can use different properties

Instruction mobility (IM)Instruction depth (ID)Latency weighted instruction depth (LWID)Successor number (SN)

Page 7: High Level Synthesis

RCS: List Scheduling A simple scheduling algorithm

based on greedy strategies List scheduling algorithm:

1. Construct a priority list based on some metrics (operation mobility, numbers of successors, etc)

2. While not all operations scheduled1. For each available resource,

select an operation in the ready list following the descending priority.

2. Assign these operations to the current clock cycle

3. Update the ready list4. Clock cycle ++

Qualities depend on benchmarks and particular metrics

Page 8: High Level Synthesis

MMAS RCS: Global and Local Heuristics

Global heuristic: Pheromones : the favorableness of

selecting operation i to position j Global pheromone matrix Local heuristic:

Local metrics : Instruction mobility, number of successors, etc

Local decision making: a probabilistic decision

Evaporate pheromone and reinforce good solution

Page 9: High Level Synthesis

Pheromone Model For Instruction Scheduling

ij op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

Each instruction opi I associated with n pheromone trails where j = 1, …, neach indicates the favorableness of assign instruction i to position j

Each instruction also has a dynamic

local heuristic

ij

Page 10: High Level Synthesis

Ant Search Strategy

op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

Each run has multiple iterations Each iteration, multiple ants

independently create their own priority list

Fill one instruction at a time

op1

op2

op3

op4

op5

op6

op5

op4

op1

op6

op2

op3

Page 11: High Level Synthesis

Ant Search Strategy

Each ant has memory about instructions already selected

At step j ant has already selected j-1 instructions

jth instruction selected probabilistically

op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

op1

op2

op3

op4

op5

op6

op5

op4

op1

Page 12: High Level Synthesis

Ant Search Strategy

ij(k) : global heuristic (pheromone) for selecting instruction i at j position

j(k) : local heuristic – can use different properties Instruction mobility (IM) Instruction depth (ID) Latency weighted instruction depth (LWID) Successor number (SN)

, control influence of global and local heuristics

Page 13: High Level Synthesis

Pheromone Update

Lists constructed are evaluated with List Scheduling Latency Lh for the result from ant h Evaporation – prevent stigmergy and punish “useless” trails Reinforcement – award trails with better quality

Page 14: High Level Synthesis

Pheromone Update

op1

op2

op3

op4

op5

op6

Instructions

1

2

3

4

5

6

Priority List

Evaporation happens on all trails to avoid stigmergy

Reward the used trails based on the solution’s quality

op1

op2

op3

op4

op5

op6

op5

op4

op1

op6

op2

op3

Page 15: High Level Synthesis

Max-Min Ant System (MMAS) Risks of Ant System optimization

Positive feedback Dynamic range of pheromone trails can increase rapidly Unused trails can be repetitively punished which reduce their

likelihood even more Premature convergence

MMAS is designed to address this problem Built upon original AS Idea is to limit the pheromone trails within an evolving bound

so that more broader exploration is possible Better balance the exploration and exploitation Prevent premature convergence

Page 16: High Level Synthesis

Max-Min Ant System (MMAS)

Limit (t) within min(t) and max(t)

Sgb is the best global solution found so far at t-1 f(.) is the quality evaluation function, i.e. latency in our case avg is the average size of decision choices

Pbest (0,1] is the controlling parameter Conditional prob. of Sgb being selected when all trails in Sgb have max

and others having min

Smaller Pbest tighter range for more emphasis on exploration

When Pbest 0, we set min max

Page 17: High Level Synthesis

Other Algorithmic Refinements

Dynamically evolving local heuristicsExample: dynamically adjust Instruction MobilityBenefit: reduce search space progressively

Taking advantage of topological sorting of DFG when constructing priority listEach step ants select from the ready instructions

instead from all unscheduled instructions Benefit: greatly reduce the search space

Page 18: High Level Synthesis

MMAS RCS Algorithm

Page 19: High Level Synthesis

RCS Results: Pheromones (ARF)

Page 20: High Level Synthesis

Benchmarks: ExpressDFG

A comprehensive benchmark for TCS/RCSClassic samples and more modern casesComprehensive coverage

Problem sizesComplexitiesApplications

Downloadable from http://express.ece.ucsb.edu/benchmark/

Page 21: High Level Synthesis

Auto Regressive Filter

Page 22: High Level Synthesis

Cosine Transform

Page 23: High Level Synthesis

Matrix Inversion

Page 24: High Level Synthesis

RCS Experimental Results

Heterogeneous RCS – multiple types of resources (e.g. fast and normal multiplier) ILP (optimal) using CPLEX List scheduling

Instruction mobility (IM), instruction depth (ID), latency weighted instruction depth (LWID), successor number (SN)

Ant scheduling results using different local heuristics (Averaged over 5 runs, each run 100 iteration with 5 ants)

Benchmark(nodes/edges)

Resources CPLEX(latency/runtime)

ForceDirected

List Scheduling MMAS-IS(average over 5 runs)

IM ID LWID SN IM ID LWID SN

HAL(21/25) la, lfm, lm, 3i, 3o

8/32 8 8 8 9 8 8 8 8 8

ARF(28/30) 2a, lfm, 2m 11/22 11 11 13 13 13 11 11 11 11

EWF(34/47) la, lfm, lm 27 /24000 28 28 31 31 28 27.2 27.2 27 27.2

FIR1 (40/39) 2a, 2m, 3i, 3o 13/232 19 19 19 19 18 17.2 17.2 17 17.8

FIR2(44/43) la, lfm, lm, 3i, 3o

14/11560 19 19 21 21 21 16.2 16.4 16.2 17

COSINE 1(66/76)

2a,2m, lfm, 3i, 3o

18 19 20 18 18 17.4 18.2 17.6 17.6

COSINE2(82/91) 2a,2m, lfm, 3i, 3o

23 23 23 23 23 21.2 21.2 21.2 21.2

Average 18 18.2 19.3 20.5 18.5

16.8 17.0 16.9 17.1

Page 25: High Level Synthesis

RCS Experimental Results

Homogenous RCS – all resources have unit delay New benchmarks (compared to last slide) too large for ILP

Page 26: High Level Synthesis

MMAS RCS: Results

Consistently generates better results over all testing cases

Up to 23.8% better than list scheduler

Average 6.4%, and up to 15% better than force-directed scheduling

Quantitatively closer to known optimal solutions

Page 27: High Level Synthesis

Idea: Combine ACO and Force Directed Scheduling Quick FDS review

Uniformly distribute the operations onto the available resources.

Operation probability

Distribution graph

Self force: changes on DG of scheduling an operation Predecessor/successor force: implicit effects on DG Schedule an operation to a step with the minimum force

MMAS TCS Formulation

Page 28: High Level Synthesis

ACO Formulation for TCS

Initialize pheromone model While (termination not satisfied)

Create ants Each ant finds a solution Evaluate solutions and update pheromone

Report the best result found

+

S

+ <

-

-

E

1

2

3

4

v2v1

v3

v4

v5

vn

v6

v7 v8

v9

v10

v11

++

SS

++ <<

--

--

EE

1

2

3

4

v2v1

v3

v4

v5

vn

v6

v7

v8

v9

v10

v11

1

4

72τ

73τ

trails ij indicates the favorableness of assigning instruction i to position j

Page 29: High Level Synthesis

29

ACO Formulation for TCS

Initialize pheromone model While (termination not satisfied)

Create ants Each ant finds a solution Evaluate solutions and update pheromone

Report the best result found Select operation oph probabilistically

Select its timestep as following:

Global Heuristics: tied with the searching experience

Local Heuristics: use the inverse of distribution graph, 1/qk(j)

Here and β are constants

Page 30: High Level Synthesis

ACO Formulation for TCS

Initialize pheromone model While (termination not satisfied)

Create ants Each ant finds a solution Evaluate solutions and update pheromone

Report the best result found

Pheromone evaporation

Rewarding good partial solutions based on solution quality

Page 31: High Level Synthesis

Final Version of MMAS-TCS

Page 32: High Level Synthesis

Effectiveness of MMAS-TCS

Page 33: High Level Synthesis

MMAS TCS: Results

MMAS TCS is more stable than FDS, especially solution highly unconstrained

258 out of 263 test cases are equal to or better than FDS results

16.4% fewer resources

Page 34: High Level Synthesis

Design Space Exploration

DSE challenges to the designerEver increasing design optionsClosely related w/ NP-hard problems

Resource allocationscheduling

Conflict objectives (speed, cost, power, …) Increasing time-to-market pressure

Page 35: High Level Synthesis

Our Focus: Timing/Cost

Timing/Cost TradeoffsKnown applicationKnown resource typesKnown operation/resource mapping

Question: find the optimal timing/cost tradeoffs Most commonly faced problem Fundamental to other design considerations

Page 36: High Level Synthesis

Common Strategies

Usually done in an ad-hoc way Experience dependent

Or Scanning the design space withResource Constrained (RCS) or Time Constrained (TCS) scheduling

What’s the problem?RCS and TCS are dual problemsCan we effectively use information from one to guide

the other?

Page 37: High Level Synthesis

Design Space Model

Page 38: High Level Synthesis

Key Observations

A feasible configuration C covers a beam starting from (tmin, C) tmin is the RCS result for C

Page 39: High Level Synthesis

Design Space Model

Page 40: High Level Synthesis

Key Observations

A feasible configuration C covers a beam starting from (tmin, C)

Optimal tradeoff curve L is monotonically non-increasing as deadline increases

Page 41: High Level Synthesis

Design Space Model

Page 42: High Level Synthesis

Theorem

If C is the optimal TCS result at time t1, then the RCS result t2 of C satisfies t2 <= t1.

More importantly, there is no configuration C′with a smaller cost can produce an execution time within [t2, t1].

Page 43: High Level Synthesis

Theorem (continued)

Page 44: High Level Synthesis

What does it give us?

It implies that we can construct L:Starting from the rightmost tFind TCS solution CPush it to leftwards using RCS solution of CDo this iteratively (switch between TCS + RCS)

Page 45: High Level Synthesis

DSE Using Time/Resource Duality

Page 46: High Level Synthesis

Experiments

Three DSE approachesFDS: Exhaustively scanning for TCSMMAS-TCS: Exhaustively scanning for TCS MMAS-D: Proposed method leveraging duality

* Scanning means that we perform TCS on each interested deadline

Page 47: High Level Synthesis

DSE: MMAS-D vs. FDS

Page 48: High Level Synthesis

Experimental Results

Page 49: High Level Synthesis

Algorithm Runtime

Page 50: High Level Synthesis

Real Design Complications Heterogeneous mapping

One operation has many implementations Different bit-width, e.g. 32-bit multiplier good for mul(24) and mul(32) Different area and delay

Real technology library extremely sophisticated Hard to estimate final timing and total area

Sharing depends on the cost of multiplexers Downstream tools may not generate what we expect

Resource sharing, register sharing

Downstream tools break components’ boundaries Logic synthesis, placement and routing

Page 51: High Level Synthesis

Resource Allocation and Scheduling

Scheduling Cost function? Homogeneous TCS

Total number of component

Heterogeneous TCS Total area of functional units

FPGA designs: LUTs, slicecs, BRAMs, …

ASIC design: Silicon Area Total area comes from:

Functional units Register Multiplexers Interconnect

Page 52: High Level Synthesis

A hierarchical directed graph Nodes V: operations Edges E(vi,vj,Tij): timing constraints

Timing constraint Ti,j(c,o)Start time dependencies

Finish time dependencies

Chained dependencies

Towards Real World: Constraint Graph

Page 53: High Level Synthesis

Constraint Graph: Examples

Operations a and b scheduled at same

cycle

Operation b scheduled exactly one cycle after start of

Operation a

Operation b must start after Operation a

Operation a starts at least two cycles after start of

Operation b

Page 54: High Level Synthesis

Pipelined Designs

Start a new task before the prior one completed Feedback constraints

among nodes Specific initial interval

Improve throughput Requires more

hardware

Page 55: High Level Synthesis

Operation Chaining

Two or more operations scheduled in the same clock cycle Faster/larger component Shorter latency Saving registers

Chaining across clock edges

Page 56: High Level Synthesis

Speculative Execution

Page 57: High Level Synthesis

Problem Formulation Constraint graph

Nodes V: operations Edges E: data dependencies and

timing constraints Technology library Q

Area, timing Resource constraints

Desired clock period: C Determine start time and the

allocation of each resource type Resource constraint scheduling Timing constraint scheduling

Page 58: High Level Synthesis

MMAS CRAAS: Overview

Start with an initial results Using fastest components ASAP/ALAP Resolving resource

conflicts Meet timing and resource

constraints

MMAS iteratively searches optimal solutions

Page 59: High Level Synthesis

MMAS CRAAS: ASAP/ALAP Iteratively ASAP/ALAP

Handle loops/feedbacks in constraint graphCheck ill-posed timing constraint

Page 60: High Level Synthesis

MMAS CRAAS: Initial Schedule Resource conflicts

More than available resources are used in the ASAP results

Pushing operations forward

Page 61: High Level Synthesis

MMAS CRAAS: Overview

Individual ant constructs schedulesLoad ASAP timing resultsUpdate mobility range, operation probabilityUpdate distribution graphProbabilistically defer operationsProbabilistically select operationsSchedule operations using p(i,j,k)Update ASAP/ALAP results

Page 62: High Level Synthesis

MMAS CRAAS: Global Heuristics Local heuristics

Favor smaller functional units and less registers for this operation Uniform probability among all compatible resources

Global heuristics Favor solutions with smaller area

Page 63: High Level Synthesis

MMAS CRAAS: Scheduling

Defer operations from this iteration Favor operations with many options

Schedule an operation

Update ASAP schedules Update global heuristics

Page 64: High Level Synthesis

MMAS CRAAS: Results Implemented in a leading high-level synthesis framework

Leverage the HDL back-ends to collect results Front-end parses C and performs optimizations Resource sharing and register sharing after scheduling

The existing algorithm Based on FDS/FDLS Refined for real designs Force-directed operation deferring Re-allocate resources and iterative until area increasing

Results overview 3 - 15% smaller (optimizing area) 1-4% faster (optimizing latency)

Page 65: High Level Synthesis

MMAS CRAAS: Results

Page 66: High Level Synthesis

MMAS CRAAS: Results

Hard to generate good results with control-dominant designs (158, 160, and 54)

Better resource allocation and sharing Existing algorithm prematurely converges

Consistent with previous observations

Page 67: High Level Synthesis

Conclusions and Future Research

There is (was?) room for more work in fundamental algorithms; they make a difference on real designs

Ivory Tower: Most academics do not tackle real world problems Constraint graph with pipelining, speculation, chaining Actual delay and area (mux, interconnect, …)

Gripes: Extremely hard to validate new algorithms against old ones

(e.g. no open source code for FDS!) Backend (hooks into commercial tools a la Quartus) Benchmarks?!