Download - Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Winter-Spring 2001 Codesign of Embedded Syste

ms1

Co-Synthesis Algorithms:HW/SW Partitioning

Part ofHW/SW Codesign of

Embedded Systems Course (CE 40-226)


ms2

Topics Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis


ms3

Topics Introduction A Classification Examples

Vulcan Cosyma


ms4

Introduction to HW/SW Partitioning The first variety of co-synthesis

applications Definition

A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture

Usually Multiprocessor architecture = one CPU +

some ASICs on CPU bus


ms5

Introduction to HW/SW Partitioning (cont’d) A Terminology

Allocation Synthesis methods which design the

multiprocessor topology along with the PEs and SW architecture

Scheduling The process of assigning PE (CPU and/or ASICs)

time to processes to get executed


ms6

Introduction to HW/SW Partitioning (cont’d) In most partitioning algorithms

Type of CPU is fixed and given ASICs must be synthesized

What function to implement on each ASIC? What characteristics should the implementation

have? Are single-rate synthesis problems

CDFG is the starting model


ms7

HW/SW Partitioning (cont’d) Normal use of architectural components

CPU performs less computationally-intensive functions

ASICs used to accelerate core functions Where to use?

High-performance applications No CPU is fast enough for the operations

Low-cost application ASIC accelerators allow use of much smaller,

cheaper CPU


ms8

A Classification Criterion: Optimization Strategy

Trade-off between Performance and Cost Primal Approach

Performance is the primary goal First, all functionality in ASICs. Progressively move

more to CPU to reduce cost. Dual Approach

Cost is the primary goal First, all functions in the CPU. Move operations to

the ASIC to meet the performance goal.


ms9

A Classification (cont’d) Classification due to optimization

strategy (cont’d) Example co-synthesis systems

Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy


ms10


HW/SW Partitioning Examples:Vulcan


ms11

Partitioning Examples:Vulcan Gupta, De Micheli, Stanford University Primal approach

1. All-HW initial implementation. 2. Iteratively move functionality to CPU to

reduce cost. System specification language

HardwareC Is compiled into a flow graph


ms12

Partitioning Examples:Vulcan (cont’d)

nop

x=a y=b

1 1x=a; y=b;

HardwareC

cond

x=e y=f

c>d c<=dif (c>d)x=e;

else y=f;

HardwareC


ms13

Partitioning Examples:Vulcan (cont’d) Flow Graph Definition

A variation of a (single-rate) task graph Nodes

Represent operations Typically low-level operations: mult, add

Edges Represent data dependencies Each contains a Boolean condition under which

the edge is traversed


ms14

Partitioning Examples:Vulcan (cont’d) Flow Graph

is executed repeatedly at some rate can have initiation-time constraints for each

node t(vj)+lij t(vj) t(vj)+uij

can have rate constraints on each node mi Ri Mi


ms15

Partitioning Examples:Vulcan (cont’d) Vulcan Co-synthesis Algorithm

Partitioning quantum is a thread Algorithm divides the flow graph into threads

and allocates them Thread boundary is determined by

1. (always) a non-deterministic delay element, such as wait for an external variable

2. (on choice) other points of flow graph

Target architecture CPU + Co-processor (multiple ASICs)


ms16

Partitioning Examples:Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d)

Allocation Primal approach

Scheduling is done by a scheduler on the target CPU

is generated as part of synthesis process schedules all threads (both HW and SW threads)

cannot be static, due to some threads non-deterministic initiation-time


ms17


Cost estimation SW implementation

Code size relatively straight forward

Data size Biggest challenge. Vulcan puts some effort to find bounds for each

thread HW implementation

?


ms18


Performance estimation Both SW- and HW-implementation

From flow-graph, and basic execution times for the operators


ms19

Partitioning Examples:Vulcan (cont’d) Algorithm Details

Partitioning goal Allocate each thread to one of two partitions

CPU Set: S

Co-processor set: H Required execution-rate must be met, and total

cost minimized


ms20

Partitioning Examples:Vulcan (cont’d) Algorithm Details (cont’d)

Algorithm steps1. Put all threads in H set

2. Iteratively do2.1. Move some operations to S.

2.1.1. Select a group of operations to move to S.

2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times

2.1.3. Do the move, if feasible2.2. Incrementally update the new cost-function to

reflect the new partition


ms21


Vulcan cost functionf(w) = c1Sh(H) - c2Ss(S) + c3B - c4P + c5|m|

c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be

transferred between the CPU and the co-processor


ms22


Complementary notes A heuristic to minimize communication

Once a thread is moved to S, its immediate successors are placed in the list for evaluation in the next iteration.

No back-track Once a thread is assigned to S, it remains there

Experimental results considerably faster implementations than all-SW,

but much cheaper than all-HW designs are produced


ms23


HW/SW Partitioning Examples:Cosyma


ms24

Partitioning Examples:Cosyma Rolf Ernst, et al: Technical University of

Braunschweig, Germany Dual approach

1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC

accelerator to meet performance objective. System specification language

Cx

Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG


ms25

Partitioning Examples:Cosyma (cont’d) Cosyma Co-synthesis Algorithm

Partitioning quantum is a Basic Block A Basic Blocks is a branch-free block of program

Target Architecture CPU + accelerator ASIC(s)

Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details


ms26

Partitioning Examples:Cosyma (cont’d) Cosyma Co-synthesis Algorithm (cont’d)

Performance Estimation SW implementation

Done by examining the object code for the basic block generated by a compiler

HW implementation Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required.

Communication Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory

Proportional to number of variables to be accessed


ms27

Partitioning Examples:Cosyma (cont’d) Algorithm Steps

Change in execution-time caused by moving basic block b from CPU to ASIC:

c(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)

w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU

and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC

It(b): Total number of times that b is executed


ms28

Partitioning Examples:Cosyma (cont’d) Experimental Results

By moving only basic-blocks to HW Typical speedup of only 2x Reason:

Limited intra-basic-block parallelism Cure:

Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC

Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining

Result: Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation


ms29

What we learned today HW/SW Partitioning: One broad

category of co-synthesis algorithms Criteria by which a co-synthesis

algorithm is categorized

Download - Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Top Related