rtl and behavioral synthesis -...

1

244 Lecture Notes 11-1

1

RTL and BehavioralSynthesis

Srinivas Devadas

MIT

Kurt Keutzer

Richard Newton

UC Berkeley

2

An RTL description is always implicitly structural -

• the registers and their interconnectivity are defined

• Thus the clock-to-clock behavior is defined

• Only the control logic for the transfers is synthesized.

This approach can be enhanced:

• Register inferencing

• Automating resource allocation

+

+1

*

RTL implies

b

c

a

d

e2

RTL level

a = b + c;d = a + 1;e = d * 2;

Behavioral implies

e = 2 * (b + c + 1)

2


3

An behavioral description is always functional

• Temporal relationships are only expressed as precedences

• An entire architecture is synthesized from the behavioral description

The two key elements of behavioral :

• Automating resource allocation (could be RTL also)

• Schedule

+

+1

*

RTL implies

b

c

a

d

e2

Behavioral level

a = b + c;d = a + 1;e = d * 2;

Behavioral implies

e = 2 * (b + c + 1)

4

Two Steps:

1. Translation from RTL description into gates

2. Optimization of logic

Ideally, all of the intelligence should be in the optimization step

Unfortunately, starting point for optimization affects results

⇒ Need to write a “good” RTL description

RTL Synthesis

3


5

entity VHDL isport C (

A, B, C : in BIT ;Z : out BIT ;

) ;end VHDL ;

architecture VHDL_1 of VHDL isbegin

Z ⇐ (A and B) or C ; }end VHDL_1 ;

entity and port declarationsdefine interface

definesimplementation

logic expression

AB

C

VHDL

Z

6

process beginwait until CLOCK’ event and CLOCK = ‘1’ ;if (ENABLE = ‘1’) then

TOGGLE = not TOGGLE;end if;

end process;

wait infers a flip-flop

>

TOGGLEENABLE

CLOCK

Wait Statements

4


7

Current scenario: Designer iterates over RTL descriptions using simulation and synthesis tools to verify and evaluate design

PROBLEM: Equivalent RTL descriptions may result in dramatically different logic circuits in terms of area/performance EVEN AFTER OPTIMIZATION.

Writing RTL Descriptions

8

push multiplexors to inputs

_

+

a

b

SUB

result1

0

result

a

b

-b

SUB

Equivalent RTL Descriptions

if (SUB)result = a - b;

elseresult = a + b;

if (SUB)b = -b;

result = a + b;

+1

0

5


9

if (A) then {if (C) then

…}if (A) then { avoid

if (C) then { unnecessary} nested if statements

… }

AC

CA

False Paths

1

0 1

0

10

1. Increase power of optimization.

2. Define “good” input and restrict oneselfto writing “good” input descriptions.

3. Manipulate input descriptions to make them “good”.

Rules for writing descriptions based onknowledge that a library of descriptions are good starting points for synthesis.

Practical Solution

6


11

Languages versus Models

⊕

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗

⊗ ⊗

D

Esterel MatlabVHDLUML“C, C++”

DiscreteEvent

SynchronousReactive

“FSMs” C forRTOS

FPGA-Based“Von Neumann”

ProcessorHard-Wired

ASIC RTOS

Language

Model ofComputation

ImplementationMedia

12


⊕

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗

⊗ ⊗

D


DiscreteEvent

SynchronousReactive

“FSMs” C forRTOS


ProcessorHard-Wired

ASIC RTOS

7


13


⊕

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗

⊗ ⊗

D


SynchronousReactive


ProcessorHard-Wired

ASIC RTOS

SLSMSLSMMLSMMLSM

(really just syntax)

14

Languages versus Models:A Software Analogy

f77

"C"

Lisp

Pascal

Von Neumann Computer

Intermediate Form

⊕

D DD

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗

⊗ ⊗

D

UDL-I VHDL

Silage

Hardware Implementation

Hardware Intermediate Form

8


15

Behavioral Synthesis Features

Scheduling of operations– Operation Chaining and Multi-cycling– Different Scheduling Modes, and constraints

Memory Inferencing– Array’s can be mapped directly to RAM– Automatic control/data/address generation– RAM tradeoffs (1 port vs. 2 port) are easy

Using Pipelined Components– Multipliers, Adders, RAMs, others

Pipelining Loop with various Throughput & Latency

Allocation of resources– Building your own design components

Automatic generation of FSM Controller and integration of Datapath

16

What is an Architecture?

m Amdahl, Blaauw, and Brooks, 1964, defined three interfaces:

Computer Architecture: "The attributes of a computer as seen by a machine language programmer."

Implementation:"Actual hardware structure, logic design, anddatapath organization."

Realization: "Encompasses the logic technologies, packaging, and interconnection."

9


17


m A description of the behavior of a system that is independent of its implementation.

Ô Isomorphic to its "interface specification" (Siewiorek, Bell, Newell, 1971)

For example:Ô Instruction set definition of a computerÔ Z-domain description of a filterÔ Handshaking protocol for a bus

m May guide implementation (contain 'hints' or 'pragmas')e.g. a particular specification may lead naturally to a serial or parallel

implementation.

18


Example: Instruction set

Inst_A Inst_B Inst_C ... Inst_C may not follow Inst_A

Inst_A Inst_B Inst_C ... Inst_C may not follow Inst_A because Inst_A uses a scratchpad register that Inst_C will over-write.

Architecture Not architecture

10


19

Specification vs. DescriptionSpecification: Saying what I want; describes behavior in terms of

results.e.g. ∀A { A[i,j] ← 0}

Description: Saying how to do it; describes behavior in terms of procedure or process.e.g. for(i=0; i<N; i++)

for(j=0; j<M; j++)A[i][j] = 0;

We do not have specification languages for general-purpose digital design. For some special-purpose applications (e.g. DSP) we do.

20

Level of Acceptance of SynthesisTechniques

Source: A. DeGeus

S pe e d o f De s igne r

Expe rt de s igne r

Behavio ral Synthe s is

Ac ce ptanc e c urve

Quality o f De s ign

Sequentia l Synthes is

Co mbinatio nal Sy nthe s is

11


21

Tradeoff Example - Echo Canceler

22

Echo Canceler - Tap

Total:6 Taps

12


23

Tap - Complex Multiplication

16x2->18multiplication

17x17->17addition

17x17->17subtraction

Total: 6 complex multiplications

24

Echo Canceler Complexity

Clock cycle : 30ns

Complexity : around 10K to 20K gates

Modules : 16x2 multiplier : 22ns 713 gates

17x17 add/sub : 20ns 193 gates

Source : ~2000 VHDL excl. comments

operations handled by BC operations handled by DC

addition : 30 constant : 11subtraction : 20 logic oper. : 66multiplication : 24 word expand : 100compare oper. : 2 truncation : 124data select : 29 shift oper. : 62delay elements : 24Total : 492

13


25

Echo Canceler Source Codeuse work.COSSAP_PACKAGE_SYNOPSYS.all ;

. . . . . . . . . port ( IN_R : in STD_LOGIC_VECTOR((WORDLENGTHIN -1) downto 0) ;. . . . . . . . . . . . . .

architecture behavior of ec is. . . . . . . . . . . .ot := (in2 and mask) or (in1 and not mask) ;delay_line_M_6_7_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_7_8 := (others => (others => '0'));delay_line_M_6_7_5 := (others => (others => '0'));delay_line_M_6_7_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_6_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_6_8 := (others => (others => '0'));delay_line_M_6_6_5 := (others => (others => '0'));

delay_line_M_6_11_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_14_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_14_8 := (others => (others => '0'));delay_line_M_6_14_5 := (others => (others => '0'));delay_line_M_6_14_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));wait untilCLOCK'event and CLOCK='1';

M_6_7_11_11_SIG_0010 := SIGNED(M_6_7_11_11_SIG_0008) + SIGNED(M_6_7_11_11_SIG_0009);

M_6_7_11_12_SIG_0010 := SIGNED(M_6_7_11_12_SIG_0009) -SIGNED(M_6_7_11_12_SIG_0008);



M_6_7_6_SIG_0006 := signed(M_6_7_6_SIG_0017) * signed(M_6_7_6_SIG_0024);M_6_7_6_SIG_0009 := signed(M_6_7_6_SIG_0022) * signed(M_6_7_6_SIG_0019);M_6_7_6_SIG_0016 := signed(M_6_7_6_SIG_0018_1) * signed(M_6_7_6_SIG_0020);M_6_7_6_SIG_0013 := signed(M_6_7_6_SIG_0021) * signed(M_6_7_6_SIG_0023);

delay_line_M_6_6_5 := M_6_6_SIG_0034 & delay_line_M_6_6_5(1 to 1 -1);delay_line_M_6_6_5(1) := M_6_6_SIG_0034; delay_line_M_6_6_8 := M_6_6_SIG_0025 & delay_line_M_6_6_8(1 to 1 -1);delay_line_M_6_6_8(1) := M_6_6_SIG_0025;





M_6_6_6_SIG_0006 := signed(M_6_6_6_SIG_0017) * signed(M_6_6_6_SIG_0024);M_6_6_6_SIG_0009 := signed(M_6_6_6_SIG_0022) * signed(M_6_6_6_SIG_0019);M_6_6_6_SIG_0016 := signed(M_6_6_6_SIG_0018_1) * signed(M_6_6_6_SIG_0020);





M_6_6_6_SIG_0006 := signed(M_6_6_6_SIG_0017) * signed(M_6_6_6_SIG_0024);


M_6_6_6_SIG_0016 := signed(M_6_6_6_SIG_0018_1) * signed(M_6_6_6_SIG_0020);



















26

Scheduling Tradeoffs I

* *

<

Design #2 Design #3

Design Tradeoff with Scheduling

**

+

<

+ **

<+

**

<+

Design #1

clock 50ns2 fast multipliers

3 cycles

clock 50ns1 fast multipliers

4 cycles

clock 50ns2 slow multipliers

4 cycles

multi-cycling

14


27

Scheduling Tradeoffs II

Design #5

Tradeoff:

Clock speed vs Latency vs Resources

*

*

<+

clock 50ns1 pipeline multiplier

5 cycles

Design #4

**

<+

clock 100ns2 slow multipliers

2 cycles

chaining

2-stagepipelinedmultiplier

28

Scheduling Tradeoff Results

Tradeoff: Resources, Area &Number of cycles

# multiplier# adderTotal Area

4 8 16 24 32 40 48 # cycles

Area

10

42 2 1 1 1

15

7

4 52 2 1

15


29

Scheduling Modes

Cycle-Fixed Mode– In this mode, the exact cycle times of read and write

operations on signals are preserved. Operations on variables are free to move around.

Superstate-Fixed Mode– In this mode, the relative times of read and write

operations are preserved, but superstates(delimited by wait statements) may be stretched. Operations on variables are free to move around.

Free-floating Mode– In this mode, computations, read and write operations

may flow freely past wait statements

In all cases, user can provide constraints to control the scheduling process.

30

Approaches to Scheduling

(1) Exhaustive ApproachesÔ e.g. EXPL (Barbacci , 1962): exhaustive search, Hafer & Parker: Branch & bound

(2) As-Soon-As-Possible (ASAP)Ô Used by many pioneering systems (e.g. early CMUDA, MIMOLA, and Flamel )

16


31

As-Soon-As-Possible (ASAP) andAs-Late-As-Possible (ALAP) Scheduling

⊕⊗

⊗⊗ ⊗ ⊕

⊗ <

1

2

3

4

⊗

⊕

⊗⊗⊗

⊗⊕⊗

<

⊗

ASAP ALAP

32

Approaches to Scheduling(3) List Scheduling

Ô Keep list of operations available to be scheduled (i.e. whose predecessors have already been scheduled) and order them for scheduling using some priority function

• BUD - length of path from operation to end of enclosing block• Elf, ISYN - "urgency": length of shortest path from operation to

nearest local constraintÔ O(C NlogN) time complexity

(4) Force-Directed ApproachesÔ "Force" between operation and control-step proportional to the number of operations

of the same type that could go in that control stepÔ Scheduling to minimize force tends to minimize the number of resources needed

(e.g. HAL)

Ô O(C N^2) time complexity

17


33

Minimizing Resources UsingForce-Directed Techniques

⊕

⊗⊗⊗

⊗⊕

⊗<

1

2

3

4

⊗⊕

⊗⊗⊗

⊗⊕

⊗<

×± L

21 -

2- 1

21 -

- 2 -

×± L

21 -

21 -

11 -

- 1 1

34

Basic Force-Directed Algorithm

repeat {

(1) evaluate ASAP and ALAP limits for each operation;(2) create or update distribution graphs;(3) calculate the self-force for every unscheduled

operation and feasible control step;(4) add predecessor and successor forces to self-force;(5) schedule the operation with lowest force;

} until ( all operations are scheduled;)

18


35

5th-Order Wave-Digital Filter

⊕

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗

⊗ ⊗

D

m 26 additions and 8 multiplies, as specified

36

5th Order Wave-Digital Filter Dataflow Graph

++

+

+

+

++

++

+

++

++

++

+

++

++

+

+

++

+

××

××

×

××

×

12111091 2 3 4 5 6 7 8 13 14 15 16 17 18

19


37

Results from List Scheduling

+++

+

+++

12

+++ +

+ ++

13

14

15

16

17

18

19

++

++

+++ +

++ +

+

× ×

× ×

××

×

×

Case 1 Case 2

38

ASAP / ALAP Schedules for Case 1

+++ ++

12

+++ +

+

+

+

+

++

13

14

15

16

17

13

14

15

16

17

DG: addDG: mult

1 2 3 4

Resources: add, mult

× × × ××

20


39

ASAP / ALAP Schedules for Case 2

+

+

++

12

13

14

15

16

17

18

+++

++

+

×++

×+

++

+

+

+

×++

13

14

15

16

17

18

DG: addDG: multResources: add, mult

1 2 3 4

×× ×

×

40

Representative Results for 5th Order WDF

Synthesis System Adders Multplrs Cycles

FDS(HAL) 2 2 19

APARTY(SAW), CMU 2 2 19

CADDY 2 2 18

FDLS(HAL) 2 2 18

<ESC> 2 2 18

HAL, CADDY, Nische 2 1p 19

SPAID, <ESC>

21


41

Memory Types, Embedded

Memory Types:

• Synchronous : Inputs & outputs occur at clock edge

• Asynchronous : Unclocked signals

• Single Port: Either read or write to one address location

• Dual/Multi Port: Two fully independent read/write ports

DualPort

Data In A

Data In B

Data Out A

Data Out B

Address A Address B

Enable A

SinglePort

Data In Data Out

Address

Enable Enable B

42

Memory Inferencing------ RAM Declaration Section -------type word is integer range 0 to 255;variable a : array N downto 1 of word;variable b : array M downto N + 1 of word;constant MY_RAM : resource := 0;attribute variables of MY_RAM is “a b”;attribute map_to_module of MY_RAM is “DW03_ram1_s_d”;

---- RAM Read and Write Example -------a(i) := x * y; -- writeSUM := a(i) + b(N+i); -- read

/***** RAM Declaration Section *****/reg [7:0] a[N:0], b[M:N+1];/* synopsys resource MY_RAM: variables = “a b”,

map_to_module = “DW03_ram1_s_d”; */

/***** RAM Read and Write Example *****/a[i] = x * y;SUM = a[i] + b[N+i];

Support for Single Port, Dual Port, Etc.

Several arrays can be mapped to the same RAM

22


43

Memory Tradeoff Example/***** RAM Declaration Section *****/reg [7:0] p[0:63], q[0:63];

/* synopsys resource MY_RAM: variables = “p”, map_to_module =

“DW03_ram1_s_d”; *//* synopsys resource MY_RAM: variables = “q”,

map_to_module = “DW03_ram1_s_d”; */. . . . . . . .

begin : countingforever begin/***** RAM Read and Write Example *****/incm_p : for (i=0; i<63; i=i-1) begin

q[i] = p[i] + 1; /* This will be unrolled */end

end end // counting

DualPort

Data In A

Data In B

Data Out A

Data Out B

Address A Address B

Enable A

SinglePort

Data In Data Out

Address

Enable Enable B

44

Memory Tradeoff Results

Tradeoff: RAM read/write ports, Area & Latency

1 2 3 4

# port in RAM

Latency

read 1cwrite 2c

read 1cwrite 2c

read 1cwrite 1c

read 1cwrite 1c

latency132

latency131

latency67

latency67

# R/W cycles

23


45

Loop PipeliningFor high-throughput designs

– Pipelined to start every two cycles, Operations overlapped.

– Latency remains the same. Throughput increases and input/output streams become more regular.

i +

1

j+

+

A(i) B(i)

Z(i)A(j) B(j)

Z(j)

i +

1

j+

+

A(i) B(i)

Z(i)

A(j) B(j)

Z(j)

i +

1

j

With Loop Pipelining…

…Significantly Higher Throughput

Without Loop Pipelining

46

Loop Pipelining Tradeoffs

**

<+

Without Loop pipelining:

Latency 4 cyclesThroughput 4 cycles

1 fast multiplier

**

<+ *

*

<+

**

<+

**

<+


2 fast multipliers


1 fast multiplier

24


47

Loop Pipelining Tradeoffs ResultsTradeoff: Resources, Initiation, Throughput & Area

# cycles

# multiplier# adderTotal Area

Area

4

910

16

throughputlatency

4 cycles8 cycles

8 cycles16 cycles

3

7

12 cycles24 cycles

16 cycles32 cycles

2

6

48

Pipelined Components• If faster multiplier required, use

pipelined multipliers from design library- Two-stage- Three-stage

• Benefit of using design library- Behavioral synthesisautomaticallygenerates control logic- Better area efficiency- Resource sharing

Write SourceCode

Check Timing(DC Output)

If Multiplier onCritical Path

• Pipeline Multiplier

• Modify Source Codeto Add Pipeline (States)

Change y = a * b toy = mult_2s(a,b,clk)

RTL synthesis

BehavioralSynthesis

Data 1

Data 1 Result 1 Data 2 Result 2

Clock

Multiplier

New Clock

PipelineMultiplier

P Res 1 Res 1Data 2

Data 3P Res 2 Res 2

Res 3P Res 3

25


49

Pipelined Tradeoff Results

Tradeoff: Resources, Area & Number of cycles

# pipeline multiplier

# adderTotal Area

4 8 16 24 32 40 48 # cycles

Area

12

42 2 1 1 1

16

10

7 64 3 5

50

Creating customized design library parts

Incremental Step for defining design library components for Behavioral Synthesis

Source Design(.vhd, .v, .db)

Synthesis Directives(loads, drives, etc)

designlibrary

developer

Semiconductor Vendor

Design LibraryComponents

Your CustomDesign LibraryComponents

SynopsysDesign LibraryComponents

Design LibraryInput

Library (.sl)

26


51

Identify the common expression

**+

+

**+

**+

**+

+ +

Implemented as design library component

function [15:0] sum_of_product;// map_to_operator dw_sop// return_port_name szinput [7:0] a, b, c, d;

beginsum_of_product = a * b + c * d;

end

**+

52

Creating the implementation

function sum_of_product (a, ..: input UNSIGNED)return UNSIGNED is

-- pragma map_to_operator dw_sop-- pragma return_port_name sz

begin........end;

Function Definition

Synthetic Operator

Synthetic Module

Implementations

dw_sop

Small Fast

dw_sop_mod .sl file

a

b*

c

d*

sz+

27


53

Other Features included

Bit-level timing of design library components– Does the operation fit within a clock?

– Do multiple operations fit within a clock?

When should the operations be performed– 3rd, 4th cycle?– How many operations in each clock cycle?

Register assignment– Where should temporary values be stored?– How long are the inputs, temporary values needed?

Is the mux cost more expensive than sharing a resource?

54

Summary of all Tradeoff Results

18 different design tradeoffs

in 3 days

4 8 16 24 32 40 48 Throughput

Area

with pipeline multiplier

with loop pipeliningwith different schedule

28


55

Behavioral Design BenefitsBenefits:

– more abstraction• specifying functionality instead of

implementation

– faster simulation– system level design

• latency, #cycles, #resources, clock period

• throughput and area correlation– better quality of result

• multi-cycle optimization

• optimization not limited by register boundaries

– automatic generation of FSM

Latency

Area

56

RTL synthesis is mature and synthesis from RTL descriptions is the most popular means of integrated circuit design

Behavioral synthesis is much less mature– Initial use in DSP design (constrained problems:

such as register allocation given a schedule)– Growing use in video (lots of arithmetic),

networking (complex schedules), and gradually, general ASIC design

Summary

29


57

SYNTHESIS REFERENCES

Overviews:– Giovanni De Micheli,Synthesis and Optimization of Digital

Circuits, McGraw-Hill 1994 – S. Devadas, A. Ghosh, K. Keutzer, Logic Synthesis, McGraw-

Hill 1994

Particularly for DSP:– S. Note et al. ``Combined Hardware Selection and Pipelining in

High-Level Performance Data-Path Design,’’ IEEE Transactions on CAD/ICAS, Vol. CAD 11, pp. 413-423, April 1992.

– J. Rabaey et al. ``Cathedral II: A Synthesis System for Multiprocessor DSP Systems,’’ in Silicon Compilation, Edited by Dan Gajski, Addison-Wesley 1988.

58

Extras

30


59

Manipulating and optimizing RTL descriptions while maintaining required functionality but changing FSM behavior

Behavioral transformations

– Parallelize, serialize or pipeline a behavioral description

– Allocate hardware; registers buses, ALUs to obtain a RTL description with a definite structure

Behavioral Synthesis

60

Elliptic-filter (In, Out)

{i = Ina = i + t2b = a + t13g = t33 + t39e = g + t26 + b

d = (m21 * e) + bf = (m24 * e) + g

… …

Out = o

}

Behavioral Specification

31


61

Treat the given description as a “dataflow” type specification.

Synthesize a datapath by first constructing a schedule of operations, that will require a certain amount of hardware, and then synthesize a controller given the datapathand associated schedule.

Control is viewed totally as a by-product ofdatapath synthesis.

Hardware Allocation

62

V3 = V1 + V2V5 = V3 - V4V7 = V3 * V6V8 = V3 + V5V9 = V1 + V7V11 = V10 / V5V13 = V3V12 = V1V14 = V11 and V8V15 = V12 or V9V1 = V14V2 = V15

Example Behavior

32


63

A schedule:

r3 = r1 + r2 r8 = r1r7 = r3 - r4r2 = r3 * r5r3 = r3 + r7r2 = r1 + r2r7 = r6 / r7r1 = r3 and r7r2 = r2 or r8

V1 to V15 merged into r1 to r8

Schedule - I

64

ALUr1 r2 r3 r4 r5 r6 r7 r8

1

2 3

4

5 6 7

8

9 10 11

12

13 14

15 16

r3 = r1 + r2 r8 = r1r7 = r3 - r4r2 = r3 * r5r3 = r3 + r7r2 = r1 + r2r7 = r6 / r7r1 = r3 and r7r2 = r2 or r8

Datapath - I

ALU performs +, -, *, /, and, or

33


65

Controller - I

Controller:

0 S0 S0 N0P 0000000000000000

1 S0 S1 N0P 0000000000000000

- S1 S2 ADD 1111010010000010

- S2 S3 SUB 1110000101000100

- S3 S4 MUL …

- S4 S5 ADD …

- S5 S6 ADD

- S6 S7 DIV

- S7 S8 AND

- S8 S0 OR …

66

r3 = r1 + r2 r8 = r1r5 = r3 - r4 r2 = r3 * r6r3 = r3 + r5 r5 = r5 * r7r2 = r1 and r2 r1 = r3 or r5r2 = r8 + r2

ALU1

1 2

3

and+-

4

6

5

ordiv*

r1 r2 r3 r4 r5 r6 r7 r8

7 8

9

10 11

15

13

14

12

16

17 18 19

20

21

2223

Schedule and Datapath - II

ALU2

34


67

Controller

0 S0 S0 NOP NOP 000000000000000000000001 S0 S1 NOP NOP 00000000000000000000000- S1 S2 ADD NOP 11100001000100100000010- S2 S3 SUB MUL … - S3 S4 ADD DIV … - S4 S5 AND OR- S5 S6 ADD NOP …

Controller - II

68

Arithmetic unit allocation: Decide the number of arithmetic units, as well as the grouping of arithmetic operators

e.g., ALU1: + , -ALU2: + , -, incrALU3: * , shift

or ALU1: + , - , incr , shiftALU2: *

Register allocation: Coalesce the variables into a minimum number of registers

Interconnect allocation: Allocate buses and links, to implement all the required data transfers between the registers and ALUs.

Allocation Problems

35


69

Call the X dimension (hardware) space and the Y dimension (execution) time

Minimum number of ALUs:– Maximum number of occupied space slots.

Each ALU performs operations in its slot.

Minimum number of registers:– Maximum density of variable lifetimes across

all the time slots.

Interconnections related to the number of parallel data transfers in each time slot.

SpaceTime

Z1 = X1 + Y1K1 = Z1 - X2L1 = Z2 or Z3

Z2 = X1 * Y2K2 = Z2 @ X1

Z3 = X2

K3 = K2 + 1

Placement Formulation

70

Given a code sequence, coalesce the variables into a minimum number of registers.

Define the lifetime of a variable as the interval between its first assignment and last use.

Merge variables with non-overlapping lifetimes

Lifetimes are open intervals, at bottom end.

Register Allocation

V1 = V2 + V3

V4 = V2 - V3

V5 = V1 * V2

V6 = V4 and V3

V7 = V5 or V6

V1 V2 V3 V4 V5 V6 V7

36


71

Maximum density of variable lifetimes across all timeslots = minimum # registers required

Merge variables into # registers = maximum density

Different schedules have different max. densities.

R1 = R2 + R3R4 = R2 - R3R1 = R1 * R2R4 = R4 and R3R4 = R1 or R4

(V1, V5) R1

(V4, V6, V7) R4

Register Allocation - II

V1 = V2 + V3

V4 = V2 - V3

V5 = V1 * V2

V6 = V4 and V3

V7 = V5 or V6

V1 V2 V3 V4 V5 V6 V7

V2 R2

V3 R3

rtl and behavioral synthesis -...

Documents