high level synthesis
DESCRIPTION
High Level Synthesis. CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner. ?. Ant System Optimization: Overview. Ants work corporately on the graph Each creates a feasible solution Ants leave pheromones on their traces Ant make decisions partially on amount of pheromones - PowerPoint PPT PresentationTRANSCRIPT
High Level Synthesis
CSE 237D: Spring 2008 Topic #6
Professor Ryan Kastner
Ant System Optimization: Overview
?
Ants work corporately on the graph Each creates a feasible solution
Ants leave pheromones on their traces
Ant make decisions partially on amount of pheromones
Global Optimizations Evaporation: Pheromones dissipate over
time Reinforcement: Update pheromones
from good solutions
Quickly converges to good solutions
Solving Design Problems using AS
Problem model Define the solution space: create decision
variables Pheromone model
Global heuristic: Provides history of search space traversal
Ant search strategy Local heuristic: Deterministic strategy for
individual ant decision making Solution construction
Probabilistically derive solution from local and global heuristics
Feedback Evaluate solution quality, Reinforce good
solutions (pheromones), Slightly evaporate all decisions (weakens poor solutions)
Autocatalytic Effect
Max-Min Ant System (MMAS) Scheduling
Problem: Some pheromones can overpower others leading to local minimums (premature convergence)
Solution: Bound the strength of the pheromones
If , always a chance to make any decision If , the decision is based solely on
local heuristics, i.e. no past information is taken into account
MMAS RCS Formulation
Idea: Combine ACO and List SchedulingAnts determine priority listList scheduling framework evaluates the “goodness”
of the listGlobal heuristics permutation indexLocal heuristic – can use different properties
Instruction mobility (IM)Instruction depth (ID)Latency weighted instruction depth (LWID)Successor number (SN)
RCS: List Scheduling A simple scheduling algorithm
based on greedy strategies List scheduling algorithm:
1. Construct a priority list based on some metrics (operation mobility, numbers of successors, etc)
2. While not all operations scheduled1. For each available resource,
select an operation in the ready list following the descending priority.
2. Assign these operations to the current clock cycle
3. Update the ready list4. Clock cycle ++
Qualities depend on benchmarks and particular metrics
MMAS RCS: Global and Local Heuristics
Global heuristic: Pheromones : the favorableness of
selecting operation i to position j Global pheromone matrix Local heuristic:
Local metrics : Instruction mobility, number of successors, etc
Local decision making: a probabilistic decision
Evaporate pheromone and reinforce good solution
Pheromone Model For Instruction Scheduling
ij op1
op2
op3
op4
op5
op6
Instructions
1
2
3
4
5
6
Priority List
Each instruction opi I associated with n pheromone trails where j = 1, …, neach indicates the favorableness of assign instruction i to position j
Each instruction also has a dynamic
local heuristic
ij
Ant Search Strategy
op1
op2
op3
op4
op5
op6
Instructions
1
2
3
4
5
6
Priority List
Each run has multiple iterations Each iteration, multiple ants
independently create their own priority list
Fill one instruction at a time
op1
op2
op3
op4
op5
op6
op5
op4
op1
op6
op2
op3
Ant Search Strategy
Each ant has memory about instructions already selected
At step j ant has already selected j-1 instructions
jth instruction selected probabilistically
op1
op2
op3
op4
op5
op6
Instructions
1
2
3
4
5
6
Priority List
op1
op2
op3
op4
op5
op6
op5
op4
op1
Ant Search Strategy
ij(k) : global heuristic (pheromone) for selecting instruction i at j position
j(k) : local heuristic – can use different properties Instruction mobility (IM) Instruction depth (ID) Latency weighted instruction depth (LWID) Successor number (SN)
, control influence of global and local heuristics
Pheromone Update
Lists constructed are evaluated with List Scheduling Latency Lh for the result from ant h Evaporation – prevent stigmergy and punish “useless” trails Reinforcement – award trails with better quality
Pheromone Update
op1
op2
op3
op4
op5
op6
Instructions
1
2
3
4
5
6
Priority List
Evaporation happens on all trails to avoid stigmergy
Reward the used trails based on the solution’s quality
op1
op2
op3
op4
op5
op6
op5
op4
op1
op6
op2
op3
Max-Min Ant System (MMAS) Risks of Ant System optimization
Positive feedback Dynamic range of pheromone trails can increase rapidly Unused trails can be repetitively punished which reduce their
likelihood even more Premature convergence
MMAS is designed to address this problem Built upon original AS Idea is to limit the pheromone trails within an evolving bound
so that more broader exploration is possible Better balance the exploration and exploitation Prevent premature convergence
Max-Min Ant System (MMAS)
Limit (t) within min(t) and max(t)
Sgb is the best global solution found so far at t-1 f(.) is the quality evaluation function, i.e. latency in our case avg is the average size of decision choices
Pbest (0,1] is the controlling parameter Conditional prob. of Sgb being selected when all trails in Sgb have max
and others having min
Smaller Pbest tighter range for more emphasis on exploration
When Pbest 0, we set min max
Other Algorithmic Refinements
Dynamically evolving local heuristicsExample: dynamically adjust Instruction MobilityBenefit: reduce search space progressively
Taking advantage of topological sorting of DFG when constructing priority listEach step ants select from the ready instructions
instead from all unscheduled instructions Benefit: greatly reduce the search space
MMAS RCS Algorithm
RCS Results: Pheromones (ARF)
Benchmarks: ExpressDFG
A comprehensive benchmark for TCS/RCSClassic samples and more modern casesComprehensive coverage
Problem sizesComplexitiesApplications
Downloadable from http://express.ece.ucsb.edu/benchmark/
Auto Regressive Filter
Cosine Transform
Matrix Inversion
RCS Experimental Results
Heterogeneous RCS – multiple types of resources (e.g. fast and normal multiplier) ILP (optimal) using CPLEX List scheduling
Instruction mobility (IM), instruction depth (ID), latency weighted instruction depth (LWID), successor number (SN)
Ant scheduling results using different local heuristics (Averaged over 5 runs, each run 100 iteration with 5 ants)
Benchmark(nodes/edges)
Resources CPLEX(latency/runtime)
ForceDirected
List Scheduling MMAS-IS(average over 5 runs)
IM ID LWID SN IM ID LWID SN
HAL(21/25) la, lfm, lm, 3i, 3o
8/32 8 8 8 9 8 8 8 8 8
ARF(28/30) 2a, lfm, 2m 11/22 11 11 13 13 13 11 11 11 11
EWF(34/47) la, lfm, lm 27 /24000 28 28 31 31 28 27.2 27.2 27 27.2
FIR1 (40/39) 2a, 2m, 3i, 3o 13/232 19 19 19 19 18 17.2 17.2 17 17.8
FIR2(44/43) la, lfm, lm, 3i, 3o
14/11560 19 19 21 21 21 16.2 16.4 16.2 17
COSINE 1(66/76)
2a,2m, lfm, 3i, 3o
18 19 20 18 18 17.4 18.2 17.6 17.6
COSINE2(82/91) 2a,2m, lfm, 3i, 3o
23 23 23 23 23 21.2 21.2 21.2 21.2
Average 18 18.2 19.3 20.5 18.5
16.8 17.0 16.9 17.1
RCS Experimental Results
Homogenous RCS – all resources have unit delay New benchmarks (compared to last slide) too large for ILP
MMAS RCS: Results
Consistently generates better results over all testing cases
Up to 23.8% better than list scheduler
Average 6.4%, and up to 15% better than force-directed scheduling
Quantitatively closer to known optimal solutions
Idea: Combine ACO and Force Directed Scheduling Quick FDS review
Uniformly distribute the operations onto the available resources.
Operation probability
Distribution graph
Self force: changes on DG of scheduling an operation Predecessor/successor force: implicit effects on DG Schedule an operation to a step with the minimum force
MMAS TCS Formulation
ACO Formulation for TCS
Initialize pheromone model While (termination not satisfied)
Create ants Each ant finds a solution Evaluate solutions and update pheromone
Report the best result found
+
S
+ <
-
-
E
1
2
3
4
v2v1
v3
v4
v5
vn
v6
v7 v8
v9
v10
v11
++
SS
++ <<
--
--
EE
1
2
3
4
v2v1
v3
v4
v5
vn
v6
v7
v8
v9
v10
v11
1
4
72τ
73τ
trails ij indicates the favorableness of assigning instruction i to position j
29
ACO Formulation for TCS
Initialize pheromone model While (termination not satisfied)
Create ants Each ant finds a solution Evaluate solutions and update pheromone
Report the best result found Select operation oph probabilistically
Select its timestep as following:
Global Heuristics: tied with the searching experience
Local Heuristics: use the inverse of distribution graph, 1/qk(j)
Here and β are constants
ACO Formulation for TCS
Initialize pheromone model While (termination not satisfied)
Create ants Each ant finds a solution Evaluate solutions and update pheromone
Report the best result found
Pheromone evaporation
Rewarding good partial solutions based on solution quality
Final Version of MMAS-TCS
Effectiveness of MMAS-TCS
MMAS TCS: Results
MMAS TCS is more stable than FDS, especially solution highly unconstrained
258 out of 263 test cases are equal to or better than FDS results
16.4% fewer resources
Design Space Exploration
DSE challenges to the designerEver increasing design optionsClosely related w/ NP-hard problems
Resource allocationscheduling
Conflict objectives (speed, cost, power, …) Increasing time-to-market pressure
Our Focus: Timing/Cost
Timing/Cost TradeoffsKnown applicationKnown resource typesKnown operation/resource mapping
Question: find the optimal timing/cost tradeoffs Most commonly faced problem Fundamental to other design considerations
Common Strategies
Usually done in an ad-hoc way Experience dependent
Or Scanning the design space withResource Constrained (RCS) or Time Constrained (TCS) scheduling
What’s the problem?RCS and TCS are dual problemsCan we effectively use information from one to guide
the other?
Design Space Model
Key Observations
A feasible configuration C covers a beam starting from (tmin, C) tmin is the RCS result for C
Design Space Model
Key Observations
A feasible configuration C covers a beam starting from (tmin, C)
Optimal tradeoff curve L is monotonically non-increasing as deadline increases
Design Space Model
Theorem
If C is the optimal TCS result at time t1, then the RCS result t2 of C satisfies t2 <= t1.
More importantly, there is no configuration C′with a smaller cost can produce an execution time within [t2, t1].
Theorem (continued)
What does it give us?
It implies that we can construct L:Starting from the rightmost tFind TCS solution CPush it to leftwards using RCS solution of CDo this iteratively (switch between TCS + RCS)
DSE Using Time/Resource Duality
Experiments
Three DSE approachesFDS: Exhaustively scanning for TCSMMAS-TCS: Exhaustively scanning for TCS MMAS-D: Proposed method leveraging duality
* Scanning means that we perform TCS on each interested deadline
DSE: MMAS-D vs. FDS
Experimental Results
Algorithm Runtime
Real Design Complications Heterogeneous mapping
One operation has many implementations Different bit-width, e.g. 32-bit multiplier good for mul(24) and mul(32) Different area and delay
Real technology library extremely sophisticated Hard to estimate final timing and total area
Sharing depends on the cost of multiplexers Downstream tools may not generate what we expect
Resource sharing, register sharing
Downstream tools break components’ boundaries Logic synthesis, placement and routing
Resource Allocation and Scheduling
Scheduling Cost function? Homogeneous TCS
Total number of component
Heterogeneous TCS Total area of functional units
FPGA designs: LUTs, slicecs, BRAMs, …
ASIC design: Silicon Area Total area comes from:
Functional units Register Multiplexers Interconnect
A hierarchical directed graph Nodes V: operations Edges E(vi,vj,Tij): timing constraints
Timing constraint Ti,j(c,o)Start time dependencies
Finish time dependencies
Chained dependencies
Towards Real World: Constraint Graph
Constraint Graph: Examples
Operations a and b scheduled at same
cycle
Operation b scheduled exactly one cycle after start of
Operation a
Operation b must start after Operation a
Operation a starts at least two cycles after start of
Operation b
Pipelined Designs
Start a new task before the prior one completed Feedback constraints
among nodes Specific initial interval
Improve throughput Requires more
hardware
Operation Chaining
Two or more operations scheduled in the same clock cycle Faster/larger component Shorter latency Saving registers
Chaining across clock edges
Speculative Execution
Problem Formulation Constraint graph
Nodes V: operations Edges E: data dependencies and
timing constraints Technology library Q
Area, timing Resource constraints
Desired clock period: C Determine start time and the
allocation of each resource type Resource constraint scheduling Timing constraint scheduling
MMAS CRAAS: Overview
Start with an initial results Using fastest components ASAP/ALAP Resolving resource
conflicts Meet timing and resource
constraints
MMAS iteratively searches optimal solutions
MMAS CRAAS: ASAP/ALAP Iteratively ASAP/ALAP
Handle loops/feedbacks in constraint graphCheck ill-posed timing constraint
MMAS CRAAS: Initial Schedule Resource conflicts
More than available resources are used in the ASAP results
Pushing operations forward
MMAS CRAAS: Overview
Individual ant constructs schedulesLoad ASAP timing resultsUpdate mobility range, operation probabilityUpdate distribution graphProbabilistically defer operationsProbabilistically select operationsSchedule operations using p(i,j,k)Update ASAP/ALAP results
MMAS CRAAS: Global Heuristics Local heuristics
Favor smaller functional units and less registers for this operation Uniform probability among all compatible resources
Global heuristics Favor solutions with smaller area
MMAS CRAAS: Scheduling
Defer operations from this iteration Favor operations with many options
Schedule an operation
Update ASAP schedules Update global heuristics
MMAS CRAAS: Results Implemented in a leading high-level synthesis framework
Leverage the HDL back-ends to collect results Front-end parses C and performs optimizations Resource sharing and register sharing after scheduling
The existing algorithm Based on FDS/FDLS Refined for real designs Force-directed operation deferring Re-allocate resources and iterative until area increasing
Results overview 3 - 15% smaller (optimizing area) 1-4% faster (optimizing latency)
MMAS CRAAS: Results
MMAS CRAAS: Results
Hard to generate good results with control-dominant designs (158, 160, and 54)
Better resource allocation and sharing Existing algorithm prematurely converges
Consistent with previous observations
Conclusions and Future Research
There is (was?) room for more work in fundamental algorithms; they make a difference on real designs
Ivory Tower: Most academics do not tackle real world problems Constraint graph with pipelining, speculation, chaining Actual delay and area (mux, interconnect, …)
Gripes: Extremely hard to validate new algorithms against old ones
(e.g. no open source code for FDS!) Backend (hooks into commercial tools a la Quartus) Benchmarks?!