Winter-Spring 2001 Codesign of Embedded Syste
ms1
Co-Synthesis Algorithms:HW/SW Partitioning
Part ofHW/SW Codesign of
Embedded Systems Course (CE 40-226)
Winter-Spring 2001 Codesign of Embedded Syste
ms2
Topics Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis
Winter-Spring 2001 Codesign of Embedded Syste
ms3
Topics Introduction A Classification Examples
Vulcan Cosyma
Winter-Spring 2001 Codesign of Embedded Syste
ms4
Introduction to HW/SW Partitioning The first variety of co-synthesis
applications Definition
A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture
Usually Multiprocessor architecture = one CPU +
some ASICs on CPU bus
Winter-Spring 2001 Codesign of Embedded Syste
ms5
Introduction to HW/SW Partitioning (cont’d) A Terminology
Allocation Synthesis methods which design the
multiprocessor topology along with the PEs and SW architecture
Scheduling The process of assigning PE (CPU and/or ASICs)
time to processes to get executed
Winter-Spring 2001 Codesign of Embedded Syste
ms6
Introduction to HW/SW Partitioning (cont’d) In most partitioning algorithms
Type of CPU is fixed and given ASICs must be synthesized
What function to implement on each ASIC? What characteristics should the implementation
have? Are single-rate synthesis problems
CDFG is the starting model
Winter-Spring 2001 Codesign of Embedded Syste
ms7
HW/SW Partitioning (cont’d) Normal use of architectural components
CPU performs less computationally-intensive functions
ASICs used to accelerate core functions Where to use?
High-performance applications No CPU is fast enough for the operations
Low-cost application ASIC accelerators allow use of much smaller,
cheaper CPU
Winter-Spring 2001 Codesign of Embedded Syste
ms8
A Classification Criterion: Optimization Strategy
Trade-off between Performance and Cost Primal Approach
Performance is the primary goal First, all functionality in ASICs. Progressively move
more to CPU to reduce cost. Dual Approach
Cost is the primary goal First, all functions in the CPU. Move operations to
the ASIC to meet the performance goal.
Winter-Spring 2001 Codesign of Embedded Syste
ms9
A Classification (cont’d) Classification due to optimization
strategy (cont’d) Example co-synthesis systems
Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy
Winter-Spring 2001 Codesign of Embedded Syste
ms10
Co-Synthesis Algorithms:HW/SW Partitioning
HW/SW Partitioning Examples:Vulcan
Winter-Spring 2001 Codesign of Embedded Syste
ms11
Partitioning Examples:Vulcan Gupta, De Micheli, Stanford University Primal approach
1. All-HW initial implementation. 2. Iteratively move functionality to CPU to
reduce cost. System specification language
HardwareC Is compiled into a flow graph
Winter-Spring 2001 Codesign of Embedded Syste
ms12
Partitioning Examples:Vulcan (cont’d)
nop
x=a y=b
1 1x=a; y=b;
HardwareC
cond
x=e y=f
c>d c<=dif (c>d)x=e;
else y=f;
HardwareC
Winter-Spring 2001 Codesign of Embedded Syste
ms13
Partitioning Examples:Vulcan (cont’d) Flow Graph Definition
A variation of a (single-rate) task graph Nodes
Represent operations Typically low-level operations: mult, add
Edges Represent data dependencies Each contains a Boolean condition under which
the edge is traversed
Winter-Spring 2001 Codesign of Embedded Syste
ms14
Partitioning Examples:Vulcan (cont’d) Flow Graph
is executed repeatedly at some rate can have initiation-time constraints for each
node t(vj)+lij t(vj) t(vj)+uij
can have rate constraints on each node mi Ri Mi
Winter-Spring 2001 Codesign of Embedded Syste
ms15
Partitioning Examples:Vulcan (cont’d) Vulcan Co-synthesis Algorithm
Partitioning quantum is a thread Algorithm divides the flow graph into threads
and allocates them Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait for an external variable
2. (on choice) other points of flow graph
Target architecture CPU + Co-processor (multiple ASICs)
Winter-Spring 2001 Codesign of Embedded Syste
ms16
Partitioning Examples:Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d)
Allocation Primal approach
Scheduling is done by a scheduler on the target CPU
is generated as part of synthesis process schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministic initiation-time
Winter-Spring 2001 Codesign of Embedded Syste
ms17
Partitioning Examples:Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d)
Cost estimation SW implementation
Code size relatively straight forward
Data size Biggest challenge. Vulcan puts some effort to find bounds for each
thread HW implementation
?
Winter-Spring 2001 Codesign of Embedded Syste
ms18
Partitioning Examples:Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d)
Performance estimation Both SW- and HW-implementation
From flow-graph, and basic execution times for the operators
Winter-Spring 2001 Codesign of Embedded Syste
ms19
Partitioning Examples:Vulcan (cont’d) Algorithm Details
Partitioning goal Allocate each thread to one of two partitions
CPU Set: S
Co-processor set: H Required execution-rate must be met, and total
cost minimized
Winter-Spring 2001 Codesign of Embedded Syste
ms20
Partitioning Examples:Vulcan (cont’d) Algorithm Details (cont’d)
Algorithm steps1. Put all threads in H set
2. Iteratively do2.1. Move some operations to S.
2.1.1. Select a group of operations to move to S.
2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times
2.1.3. Do the move, if feasible2.2. Incrementally update the new cost-function to
reflect the new partition
Winter-Spring 2001 Codesign of Embedded Syste
ms21
Partitioning Examples:Vulcan (cont’d) Algorithm Details (cont’d)
Vulcan cost functionf(w) = c1Sh(H) - c2Ss(S) + c3B - c4P + c5|m|
c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be
transferred between the CPU and the co-processor
Winter-Spring 2001 Codesign of Embedded Syste
ms22
Partitioning Examples:Vulcan (cont’d) Algorithm Details (cont’d)
Complementary notes A heuristic to minimize communication
Once a thread is moved to S, its immediate successors are placed in the list for evaluation in the next iteration.
No back-track Once a thread is assigned to S, it remains there
Experimental results considerably faster implementations than all-SW,
but much cheaper than all-HW designs are produced
Winter-Spring 2001 Codesign of Embedded Syste
ms23
Co-Synthesis Algorithms:HW/SW Partitioning
HW/SW Partitioning Examples:Cosyma
Winter-Spring 2001 Codesign of Embedded Syste
ms24
Partitioning Examples:Cosyma Rolf Ernst, et al: Technical University of
Braunschweig, Germany Dual approach
1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC
accelerator to meet performance objective. System specification language
Cx
Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG
Winter-Spring 2001 Codesign of Embedded Syste
ms25
Partitioning Examples:Cosyma (cont’d) Cosyma Co-synthesis Algorithm
Partitioning quantum is a Basic Block A Basic Blocks is a branch-free block of program
Target Architecture CPU + accelerator ASIC(s)
Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details
Winter-Spring 2001 Codesign of Embedded Syste
ms26
Partitioning Examples:Cosyma (cont’d) Cosyma Co-synthesis Algorithm (cont’d)
Performance Estimation SW implementation
Done by examining the object code for the basic block generated by a compiler
HW implementation Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required.
Communication Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory
Proportional to number of variables to be accessed
Winter-Spring 2001 Codesign of Embedded Syste
ms27
Partitioning Examples:Cosyma (cont’d) Algorithm Steps
Change in execution-time caused by moving basic block b from CPU to ASIC:
c(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)
w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU
and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC
It(b): Total number of times that b is executed
Winter-Spring 2001 Codesign of Embedded Syste
ms28
Partitioning Examples:Cosyma (cont’d) Experimental Results
By moving only basic-blocks to HW Typical speedup of only 2x Reason:
Limited intra-basic-block parallelism Cure:
Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC
Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining
Result: Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation
Winter-Spring 2001 Codesign of Embedded Syste
ms29
What we learned today HW/SW Partitioning: One broad
category of co-synthesis algorithms Criteria by which a co-synthesis
algorithm is categorized