rtl and behavioral synthesis -...

36
1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas Devadas MIT Kurt Keutzer Richard Newton UC Berkeley 2 An RTL description is always implicitly structural - the registers and their interconnectivity are defined Thus the clock-to-clock behavior is defined Only the control logic for the transfers is synthesized. This approach can be enhanced: Register inferencing Automating resource allocation + +1 * RTL implies b c a d e 2 RTL level a = b + c; d = a + 1; e = d * 2; Behavioral implies e = 2 * (b + c + 1)

Upload: vucong

Post on 29-Mar-2018

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

1

244 Lecture Notes 11-1

1

RTL and BehavioralSynthesis

Srinivas Devadas

MIT

Kurt Keutzer

Richard Newton

UC Berkeley

2

An RTL description is always implicitly structural -

• the registers and their interconnectivity are defined

• Thus the clock-to-clock behavior is defined

• Only the control logic for the transfers is synthesized.

This approach can be enhanced:

• Register inferencing

• Automating resource allocation

+

+1

*

RTL implies

b

c

a

d

e2

RTL level

a = b + c;d = a + 1;e = d * 2;

Behavioral implies

e = 2 * (b + c + 1)

Page 2: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

2

244 Lecture Notes 11-1

3

An behavioral description is always functional

• Temporal relationships are only expressed as precedences

• An entire architecture is synthesized from the behavioral description

The two key elements of behavioral :

• Automating resource allocation (could be RTL also)

• Schedule

+

+1

*

RTL implies

b

c

a

d

e2

Behavioral level

a = b + c;d = a + 1;e = d * 2;

Behavioral implies

e = 2 * (b + c + 1)

4

Two Steps:

1. Translation from RTL description into gates

2. Optimization of logic

Ideally, all of the intelligence should be in the optimization step

Unfortunately, starting point for optimization affects results

⇒ Need to write a “good” RTL description

RTL Synthesis

Page 3: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

3

244 Lecture Notes 11-1

5

entity VHDL isport C (

A, B, C : in BIT ;Z : out BIT ;

) ;end VHDL ;

architecture VHDL_1 of VHDL isbegin

Z ⇐ (A and B) or C ; }end VHDL_1 ;

entity and port declarationsdefine interface

definesimplementation

logic expression

AB

C

VHDL

Z

6

process beginwait until CLOCK’ event and CLOCK = ‘1’ ;if (ENABLE = ‘1’) then

TOGGLE = not TOGGLE;end if;

end process;

wait infers a flip-flop

>

TOGGLEENABLE

CLOCK

Wait Statements

Page 4: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

4

244 Lecture Notes 11-1

7

Current scenario: Designer iterates over RTL descriptions using simulation and synthesis tools to verify and evaluate design

PROBLEM: Equivalent RTL descriptions may result in dramatically different logic circuits in terms of area/performance EVEN AFTER OPTIMIZATION.

Writing RTL Descriptions

8

push multiplexors to inputs

_

+

a

b

SUB

result1

0

result

a

b

-b

SUB

Equivalent RTL Descriptions

if (SUB)result = a - b;

elseresult = a + b;

if (SUB)b = -b;

result = a + b;

+1

0

Page 5: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

5

244 Lecture Notes 11-1

9

if (A) then {if (C) then

…}if (A) then { avoid

if (C) then { unnecessary} nested if statements

… }

AC

CA

False Paths

1

0 1

0

10

1. Increase power of optimization.

2. Define “good” input and restrict oneselfto writing “good” input descriptions.

3. Manipulate input descriptions to make them “good”.

Rules for writing descriptions based onknowledge that a library of descriptions are good starting points for synthesis.

Practical Solution

Page 6: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

6

244 Lecture Notes 11-1

11

Languages versus Models

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗ ⊗

D

Esterel MatlabVHDLUML“C, C++”

DiscreteEvent

SynchronousReactive

“FSMs” C forRTOS

FPGA-Based“Von Neumann”

ProcessorHard-Wired

ASIC RTOS

Language

Model ofComputation

ImplementationMedia

12

Languages versus Models

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗ ⊗

D

Esterel MatlabVHDLUML“C, C++”

DiscreteEvent

SynchronousReactive

“FSMs” C forRTOS

FPGA-Based“Von Neumann”

ProcessorHard-Wired

ASIC RTOS

Page 7: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

7

244 Lecture Notes 11-1

13

Languages versus Models

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗ ⊗

D

Esterel MatlabVHDLUML“C, C++”

SynchronousReactive

FPGA-Based“Von Neumann”

ProcessorHard-Wired

ASIC RTOS

SLSMSLSMMLSMMLSM

(really just syntax)

14

Languages versus Models:A Software Analogy

f77

"C"

Lisp

Pascal

Von Neumann Computer

Intermediate Form

D DD

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗ ⊗

D

UDL-I VHDL

Silage

Hardware Implementation

Hardware Intermediate Form

Page 8: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

8

244 Lecture Notes 11-1

15

Behavioral Synthesis Features

Scheduling of operations– Operation Chaining and Multi-cycling– Different Scheduling Modes, and constraints

Memory Inferencing– Array’s can be mapped directly to RAM– Automatic control/data/address generation– RAM tradeoffs (1 port vs. 2 port) are easy

Using Pipelined Components– Multipliers, Adders, RAMs, others

Pipelining Loop with various Throughput & Latency

Allocation of resources– Building your own design components

Automatic generation of FSM Controller and integration of Datapath

16

What is an Architecture?

m Amdahl, Blaauw, and Brooks, 1964, defined three interfaces:

Computer Architecture: "The attributes of a computer as seen by a machine language programmer."

Implementation:"Actual hardware structure, logic design, anddatapath organization."

Realization: "Encompasses the logic technologies, packaging, and interconnection."

Page 9: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

9

244 Lecture Notes 11-1

17

What is an Architecture?

m A description of the behavior of a system that is independent of its implementation.

Ô Isomorphic to its "interface specification" (Siewiorek, Bell, Newell, 1971)

For example:Ô Instruction set definition of a computerÔ Z-domain description of a filterÔ Handshaking protocol for a bus

m May guide implementation (contain 'hints' or 'pragmas')e.g. a particular specification may lead naturally to a serial or parallel

implementation.

18

What is an Architecture?

Example: Instruction set

Inst_A Inst_B Inst_C ... Inst_C may not follow Inst_A

Inst_A Inst_B Inst_C ... Inst_C may not follow Inst_A because Inst_A uses a scratchpad register that Inst_C will over-write.

Architecture Not architecture

Page 10: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

10

244 Lecture Notes 11-1

19

Specification vs. DescriptionSpecification: Saying what I want; describes behavior in terms of

results.e.g. ∀A { A[i,j] ← 0}

Description: Saying how to do it; describes behavior in terms of procedure or process.e.g. for(i=0; i<N; i++)

for(j=0; j<M; j++)A[i][j] = 0;

We do not have specification languages for general-purpose digital design. For some special-purpose applications (e.g. DSP) we do.

20

Level of Acceptance of SynthesisTechniques

Source: A. DeGeus

S pe e d o f De s igne r

Expe rt de s igne r

Behavio ral Synthe s is

Ac ce ptanc e c urve

Quality o f De s ign

Sequentia l Synthes is

Co mbinatio nal Sy nthe s is

Page 11: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

11

244 Lecture Notes 11-1

21

Tradeoff Example - Echo Canceler

22

Echo Canceler - Tap

Total:6 Taps

Page 12: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

12

244 Lecture Notes 11-1

23

Tap - Complex Multiplication

16x2->18multiplication

17x17->17addition

17x17->17subtraction

Total: 6 complex multiplications

24

Echo Canceler Complexity

Clock cycle : 30ns

Complexity : around 10K to 20K gates

Modules : 16x2 multiplier : 22ns 713 gates

17x17 add/sub : 20ns 193 gates

Source : ~2000 VHDL excl. comments

operations handled by BC operations handled by DC

addition : 30 constant : 11subtraction : 20 logic oper. : 66multiplication : 24 word expand : 100compare oper. : 2 truncation : 124data select : 29 shift oper. : 62delay elements : 24Total : 492

Page 13: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

13

244 Lecture Notes 11-1

25

Echo Canceler Source Codeuse work.COSSAP_PACKAGE_SYNOPSYS.all ;

. . . . . . . . . port ( IN_R : in STD_LOGIC_VECTOR((WORDLENGTHIN -1) downto 0) ;. . . . . . . . . . . . . .

architecture behavior of ec is. . . . . . . . . . . .ot := (in2 and mask) or (in1 and not mask) ;delay_line_M_6_7_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_7_8 := (others => (others => '0'));delay_line_M_6_7_5 := (others => (others => '0'));delay_line_M_6_7_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_6_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_6_8 := (others => (others => '0'));delay_line_M_6_6_5 := (others => (others => '0'));

delay_line_M_6_11_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_14_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_14_8 := (others => (others => '0'));delay_line_M_6_14_5 := (others => (others => '0'));delay_line_M_6_14_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));wait untilCLOCK'event and CLOCK='1';

M_6_7_11_11_SIG_0010 := SIGNED(M_6_7_11_11_SIG_0008) + SIGNED(M_6_7_11_11_SIG_0009);

M_6_7_11_12_SIG_0010 := SIGNED(M_6_7_11_12_SIG_0009) -SIGNED(M_6_7_11_12_SIG_0008);

M_6_7_9_11_SIG_0010 := SIGNED(M_6_7_9_11_SIG_0008) + SIGNED(M_6_7_9_11_SIG_0009);

M_6_7_9_12_SIG_0010 := SIGNED(M_6_7_9_12_SIG_0009) -SIGNED(M_6_7_9_12_SIG_0008);

M_6_7_6_SIG_0006 := signed(M_6_7_6_SIG_0017) * signed(M_6_7_6_SIG_0024);M_6_7_6_SIG_0009 := signed(M_6_7_6_SIG_0022) * signed(M_6_7_6_SIG_0019);M_6_7_6_SIG_0016 := signed(M_6_7_6_SIG_0018_1) * signed(M_6_7_6_SIG_0020);M_6_7_6_SIG_0013 := signed(M_6_7_6_SIG_0021) * signed(M_6_7_6_SIG_0023);

delay_line_M_6_6_5 := M_6_6_SIG_0034 & delay_line_M_6_6_5(1 to 1 -1);delay_line_M_6_6_5(1) := M_6_6_SIG_0034; delay_line_M_6_6_8 := M_6_6_SIG_0025 & delay_line_M_6_6_8(1 to 1 -1);delay_line_M_6_6_8(1) := M_6_6_SIG_0025;

M_6_6_11_11_SIG_0010 := SIGNED(M_6_6_11_11_SIG_0008) + SIGNED(M_6_6_11_11_SIG_0009);

M_6_6_11_12_SIG_0010 := SIGNED(M_6_6_11_12_SIG_0009) -SIGNED(M_6_6_11_12_SIG_0008);

M_6_6_9_11_SIG_0010 := SIGNED(M_6_6_9_11_SIG_0008) + SIGNED(M_6_6_9_11_SIG_0009);

M_6_6_9_12_SIG_0010 := SIGNED(M_6_6_9_12_SIG_0009) -SIGNED(M_6_6_9_12_SIG_0008);

M_6_6_6_SIG_0006 := signed(M_6_6_6_SIG_0017) * signed(M_6_6_6_SIG_0024);M_6_6_6_SIG_0009 := signed(M_6_6_6_SIG_0022) * signed(M_6_6_6_SIG_0019);M_6_6_6_SIG_0016 := signed(M_6_6_6_SIG_0018_1) * signed(M_6_6_6_SIG_0020);

M_6_6_11_11_SIG_0010 := SIGNED(M_6_6_11_11_SIG_0008) + SIGNED(M_6_6_11_11_SIG_0009);

M_6_6_11_12_SIG_0010 := SIGNED(M_6_6_11_12_SIG_0009) -SIGNED(M_6_6_11_12_SIG_0008);

M_6_6_9_11_SIG_0010 := SIGNED(M_6_6_9_11_SIG_0008) + SIGNED(M_6_6_9_11_SIG_0009);

M_6_6_9_12_SIG_0010 := SIGNED(M_6_6_9_12_SIG_0009) -SIGNED(M_6_6_9_12_SIG_0008);

M_6_6_6_SIG_0006 := signed(M_6_6_6_SIG_0017) * signed(M_6_6_6_SIG_0024);

M_6_6_6_SIG_0009 := signed(M_6_6_6_SIG_0022) * signed(M_6_6_6_SIG_0019);

M_6_6_6_SIG_0016 := signed(M_6_6_6_SIG_0018_1) * signed(M_6_6_6_SIG_0020);

M_6_6_6_SIG_0013 := signed(M_6_6_6_SIG_0021) * signed(M_6_6_6_SIG_0023);

delay_line_M_6_5_5 := M_6_5_SIG_0034 & delay_line_M_6_5_5(1 to 1 -1);delay_line_M_6_5_5(1) := M_6_5_SIG_0034; delay_line_M_6_5_8 := M_6_5_SIG_0025 & delay_line_M_6_5_8(1 to 1 -1);delay_line_M_6_5_8(1) := M_6_5_SIG_0025;

M_6_5_11_11_SIG_0010 := SIGNED(M_6_5_11_11_SIG_0008) + SIGNED(M_6_5_11_11_SIG_0009);

M_6_5_11_12_SIG_0010 := SIGNED(M_6_5_11_12_SIG_0009) -SIGNED(M_6_5_11_12_SIG_0008);

M_6_5_9_11_SIG_0010 := SIGNED(M_6_5_9_11_SIG_0008) + SIGNED(M_6_5_9_11_SIG_0009);

M_6_5_9_12_SIG_0010 := SIGNED(M_6_5_9_12_SIG_0009) -SIGNED(M_6_5_9_12_SIG_0008);

M_6_5_6_SIG_0006 := signed(M_6_5_6_SIG_0017) * signed(M_6_5_6_SIG_0024);

M_6_5_6_SIG_0009 := signed(M_6_5_6_SIG_0022) * signed(M_6_5_6_SIG_0019);

M_6_5_6_SIG_0016 := signed(M_6_5_6_SIG_0018_1) * signed(M_6_5_6_SIG_0020);

M_6_5_6_SIG_0013 := signed(M_6_5_6_SIG_0021) * signed(M_6_5_6_SIG_0023);

delay_line_M_6_4_5 := M_6_4_SIG_0034 & delay_line_M_6_4_5(1 to 1 -1);delay_line_M_6_4_5(1) := M_6_4_SIG_0034; delay_line_M_6_4_8 := M_6_4_SIG_0025 & delay_line_M_6_4_8(1 to 1 -1);delay_line_M_6_4_8(1) := M_6_4_SIG_0025;

M_6_4_11_11_SIG_0010 := SIGNED(M_6_4_11_11_SIG_0008) + SIGNED(M_6_4_11_11_SIG_0009);

M_6_4_11_12_SIG_0010 := SIGNED(M_6_4_11_12_SIG_0009) -SIGNED(M_6_4_11_12_SIG_0008);

M_6_4_9_11_SIG_0010 := SIGNED(M_6_4_9_11_SIG_0008) + SIGNED(M_6_4_9_11_SIG_0009);

M_6_4_9_12_SIG_0010 := SIGNED(M_6_4_9_12_SIG_0009) -SIGNED(M_6_4_9_12_SIG_0008);

M_6_4_6_SIG_0006 := signed(M_6_4_6_SIG_0017) * signed(M_6_4_6_SIG_0024);

M_6_4_6_SIG_0009 := signed(M_6_4_6_SIG_0022) * signed(M_6_4_6_SIG_0019);

M_6_4_6_SIG_0016 := signed(M_6_4_6_SIG_0018_1) * signed(M_6_4_6_SIG_0020);

26

Scheduling Tradeoffs I

* *

<

Design #2 Design #3

Design Tradeoff with Scheduling

**

+

<

+ **

<+

**

<+

Design #1

clock 50ns2 fast multipliers

3 cycles

clock 50ns1 fast multipliers

4 cycles

clock 50ns2 slow multipliers

4 cycles

multi-cycling

Page 14: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

14

244 Lecture Notes 11-1

27

Scheduling Tradeoffs II

Design #5

Tradeoff:

Clock speed vs Latency vs Resources

*

*

<+

clock 50ns1 pipeline multiplier

5 cycles

Design #4

**

<+

clock 100ns2 slow multipliers

2 cycles

chaining

2-stagepipelinedmultiplier

28

Scheduling Tradeoff Results

Tradeoff: Resources, Area &Number of cycles

# multiplier# adderTotal Area

4 8 16 24 32 40 48 # cycles

Area

10

42 2 1 1 1

15

7

4 52 2 1

Page 15: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

15

244 Lecture Notes 11-1

29

Scheduling Modes

Cycle-Fixed Mode– In this mode, the exact cycle times of read and write

operations on signals are preserved. Operations on variables are free to move around.

Superstate-Fixed Mode– In this mode, the relative times of read and write

operations are preserved, but superstates(delimited by wait statements) may be stretched. Operations on variables are free to move around.

Free-floating Mode– In this mode, computations, read and write operations

may flow freely past wait statements

In all cases, user can provide constraints to control the scheduling process.

30

Approaches to Scheduling

(1) Exhaustive ApproachesÔ e.g. EXPL (Barbacci , 1962): exhaustive search, Hafer & Parker: Branch & bound

(2) As-Soon-As-Possible (ASAP)Ô Used by many pioneering systems (e.g. early CMUDA, MIMOLA, and Flamel )

Page 16: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

16

244 Lecture Notes 11-1

31

As-Soon-As-Possible (ASAP) andAs-Late-As-Possible (ALAP) Scheduling

⊕⊗

⊗⊗ ⊗ ⊕

⊗ <

1

2

3

4

⊗⊗⊗

⊗⊕⊗

<

ASAP ALAP

32

Approaches to Scheduling(3) List Scheduling

Ô Keep list of operations available to be scheduled (i.e. whose predecessors have already been scheduled) and order them for scheduling using some priority function

• BUD - length of path from operation to end of enclosing block• Elf, ISYN - "urgency": length of shortest path from operation to

nearest local constraintÔ O(C NlogN) time complexity

(4) Force-Directed ApproachesÔ "Force" between operation and control-step proportional to the number of operations

of the same type that could go in that control stepÔ Scheduling to minimize force tends to minimize the number of resources needed

(e.g. HAL)

Ô O(C N^2) time complexity

Page 17: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

17

244 Lecture Notes 11-1

33

Minimizing Resources UsingForce-Directed Techniques

⊗⊗⊗

⊗⊕

⊗<

1

2

3

4

⊗⊕

⊗⊗⊗

⊗⊕

⊗<

×± L

21 -

2- 1

21 -

- 2 -

×± L

21 -

21 -

11 -

- 1 1

34

Basic Force-Directed Algorithm

repeat {

(1) evaluate ASAP and ALAP limits for each operation;(2) create or update distribution graphs;(3) calculate the self-force for every unscheduled

operation and feasible control step;(4) add predecessor and successor forces to self-force;(5) schedule the operation with lowest force;

} until ( all operations are scheduled;)

Page 18: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

18

244 Lecture Notes 11-1

35

5th-Order Wave-Digital Filter

D D

D

D D

D

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕

⊕⊕ ⊕

⊕⊕ ⊕

⊕⊕

⊕ ⊕⊗

⊗ ⊗ ⊗ ⊗

⊗ ⊗

D

m 26 additions and 8 multiplies, as specified

36

5th Order Wave-Digital Filter Dataflow Graph

++

+

+

+

++

++

+

++

++

++

+

++

++

+

+

++

+

××

××

×

××

×

12111091 2 3 4 5 6 7 8 13 14 15 16 17 18

Page 19: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

19

244 Lecture Notes 11-1

37

Results from List Scheduling

+++

+

+++

12

+++ +

+ ++

13

14

15

16

17

18

19

++

++

+++ +

++ +

+

× ×

× ×

××

×

×

Case 1 Case 2

38

ASAP / ALAP Schedules for Case 1

+++ ++

12

+++ +

+

+

+

+

++

13

14

15

16

17

13

14

15

16

17

DG: addDG: mult

1 2 3 4

Resources: add, mult

× × × ××

Page 20: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

20

244 Lecture Notes 11-1

39

ASAP / ALAP Schedules for Case 2

+

+

++

12

13

14

15

16

17

18

+++

++

+

×++

×+

++

+

+

+

×++

13

14

15

16

17

18

DG: addDG: multResources: add, mult

1 2 3 4

×× ×

×

40

Representative Results for 5th Order WDF

Synthesis System Adders Multplrs Cycles

FDS(HAL) 2 2 19

APARTY(SAW), CMU 2 2 19

CADDY 2 2 18

FDLS(HAL) 2 2 18

<ESC> 2 2 18

HAL, CADDY, Nische 2 1p 19

SPAID, <ESC>

Page 21: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

21

244 Lecture Notes 11-1

41

Memory Types, Embedded

Memory Types:

• Synchronous : Inputs & outputs occur at clock edge

• Asynchronous : Unclocked signals

• Single Port: Either read or write to one address location

• Dual/Multi Port: Two fully independent read/write ports

DualPort

Data In A

Data In B

Data Out A

Data Out B

Address A Address B

Enable A

SinglePort

Data In Data Out

Address

Enable Enable B

42

Memory Inferencing------ RAM Declaration Section -------type word is integer range 0 to 255;variable a : array N downto 1 of word;variable b : array M downto N + 1 of word;constant MY_RAM : resource := 0;attribute variables of MY_RAM is “a b”;attribute map_to_module of MY_RAM is “DW03_ram1_s_d”;

---- RAM Read and Write Example -------a(i) := x * y; -- writeSUM := a(i) + b(N+i); -- read

/***** RAM Declaration Section *****/reg [7:0] a[N:0], b[M:N+1];/* synopsys resource MY_RAM: variables = “a b”,

map_to_module = “DW03_ram1_s_d”; */

/***** RAM Read and Write Example *****/a[i] = x * y;SUM = a[i] + b[N+i];

Support for Single Port, Dual Port, Etc.

Several arrays can be mapped to the same RAM

Page 22: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

22

244 Lecture Notes 11-1

43

Memory Tradeoff Example/***** RAM Declaration Section *****/reg [7:0] p[0:63], q[0:63];

/* synopsys resource MY_RAM: variables = “p”, map_to_module =

“DW03_ram1_s_d”; *//* synopsys resource MY_RAM: variables = “q”,

map_to_module = “DW03_ram1_s_d”; */. . . . . . . .

begin : countingforever begin/***** RAM Read and Write Example *****/incm_p : for (i=0; i<63; i=i-1) begin

q[i] = p[i] + 1; /* This will be unrolled */end

end end // counting

DualPort

Data In A

Data In B

Data Out A

Data Out B

Address A Address B

Enable A

SinglePort

Data In Data Out

Address

Enable Enable B

44

Memory Tradeoff Results

Tradeoff: RAM read/write ports, Area & Latency

1 2 3 4

# port in RAM

Latency

read 1cwrite 2c

read 1cwrite 2c

read 1cwrite 1c

read 1cwrite 1c

latency132

latency131

latency67

latency67

# R/W cycles

Page 23: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

23

244 Lecture Notes 11-1

45

Loop PipeliningFor high-throughput designs

– Pipelined to start every two cycles, Operations overlapped.

– Latency remains the same. Throughput increases and input/output streams become more regular.

i +

1

j+

+

A(i) B(i)

Z(i)A(j) B(j)

Z(j)

i +

1

j+

+

A(i) B(i)

Z(i)

A(j) B(j)

Z(j)

i +

1

j

With Loop Pipelining…

…Significantly Higher Throughput

Without Loop Pipelining

46

Loop Pipelining Tradeoffs

**

<+

Without Loop pipelining:

Latency 4 cyclesThroughput 4 cycles

1 fast multiplier

**

<+ *

*

<+

**

<+

**

<+

Latency 4 cyclesThroughput 1 cycles

2 fast multipliers

Latency 4 cyclesThroughput 2 cycles

1 fast multiplier

Page 24: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

24

244 Lecture Notes 11-1

47

Loop Pipelining Tradeoffs ResultsTradeoff: Resources, Initiation, Throughput & Area

# cycles

# multiplier# adderTotal Area

Area

4

910

16

throughputlatency

4 cycles8 cycles

8 cycles16 cycles

3

7

12 cycles24 cycles

16 cycles32 cycles

2

6

48

Pipelined Components• If faster multiplier required, use

pipelined multipliers from design library- Two-stage- Three-stage

• Benefit of using design library- Behavioral synthesisautomaticallygenerates control logic- Better area efficiency- Resource sharing

Write SourceCode

Check Timing(DC Output)

If Multiplier onCritical Path

• Pipeline Multiplier

• Modify Source Codeto Add Pipeline (States)

Change y = a * b toy = mult_2s(a,b,clk)

RTL synthesis

BehavioralSynthesis

Data 1

Data 1 Result 1 Data 2 Result 2

Clock

Multiplier

New Clock

PipelineMultiplier

P Res 1 Res 1Data 2

Data 3P Res 2 Res 2

Res 3P Res 3

Page 25: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

25

244 Lecture Notes 11-1

49

Pipelined Tradeoff Results

Tradeoff: Resources, Area & Number of cycles

# pipeline multiplier

# adderTotal Area

4 8 16 24 32 40 48 # cycles

Area

12

42 2 1 1 1

16

10

7 64 3 5

50

Creating customized design library parts

Incremental Step for defining design library components for Behavioral Synthesis

Source Design(.vhd, .v, .db)

Synthesis Directives(loads, drives, etc)

designlibrary

developer

Semiconductor Vendor

Design LibraryComponents

Your CustomDesign LibraryComponents

SynopsysDesign LibraryComponents

Design LibraryInput

Library (.sl)

Page 26: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

26

244 Lecture Notes 11-1

51

Identify the common expression

**+

+

**+

**+

**+

+ +

Implemented as design library component

function [15:0] sum_of_product;// map_to_operator dw_sop// return_port_name szinput [7:0] a, b, c, d;

beginsum_of_product = a * b + c * d;

end

**+

52

Creating the implementation

function sum_of_product (a, ..: input UNSIGNED)return UNSIGNED is

-- pragma map_to_operator dw_sop-- pragma return_port_name sz

begin........end;

Function Definition

Synthetic Operator

Synthetic Module

Implementations

dw_sop

Small Fast

dw_sop_mod .sl file

a

b*

c

d*

sz+

Page 27: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

27

244 Lecture Notes 11-1

53

Other Features included

Bit-level timing of design library components– Does the operation fit within a clock?

– Do multiple operations fit within a clock?

When should the operations be performed– 3rd, 4th cycle?– How many operations in each clock cycle?

Register assignment– Where should temporary values be stored?– How long are the inputs, temporary values needed?

Is the mux cost more expensive than sharing a resource?

54

Summary of all Tradeoff Results

18 different design tradeoffs

in 3 days

4 8 16 24 32 40 48 Throughput

Area

with pipeline multiplier

with loop pipeliningwith different schedule

Page 28: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

28

244 Lecture Notes 11-1

55

Behavioral Design BenefitsBenefits:

– more abstraction• specifying functionality instead of

implementation

– faster simulation– system level design

• latency, #cycles, #resources, clock period

• throughput and area correlation– better quality of result

• multi-cycle optimization

• optimization not limited by register boundaries

– automatic generation of FSM

Latency

Area

56

RTL synthesis is mature and synthesis from RTL descriptions is the most popular means of integrated circuit design

Behavioral synthesis is much less mature– Initial use in DSP design (constrained problems:

such as register allocation given a schedule)– Growing use in video (lots of arithmetic),

networking (complex schedules), and gradually, general ASIC design

Summary

Page 29: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

29

244 Lecture Notes 11-1

57

SYNTHESIS REFERENCES

Overviews:– Giovanni De Micheli,Synthesis and Optimization of Digital

Circuits, McGraw-Hill 1994 – S. Devadas, A. Ghosh, K. Keutzer, Logic Synthesis, McGraw-

Hill 1994

Particularly for DSP:– S. Note et al. ``Combined Hardware Selection and Pipelining in

High-Level Performance Data-Path Design,’’ IEEE Transactions on CAD/ICAS, Vol. CAD 11, pp. 413-423, April 1992.

– J. Rabaey et al. ``Cathedral II: A Synthesis System for Multiprocessor DSP Systems,’’ in Silicon Compilation, Edited by Dan Gajski, Addison-Wesley 1988.

58

Extras

Page 30: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

30

244 Lecture Notes 11-1

59

Manipulating and optimizing RTL descriptions while maintaining required functionality but changing FSM behavior

Behavioral transformations

– Parallelize, serialize or pipeline a behavioral description

– Allocate hardware; registers buses, ALUs to obtain a RTL description with a definite structure

Behavioral Synthesis

60

Elliptic-filter (In, Out)

{i = Ina = i + t2b = a + t13g = t33 + t39e = g + t26 + b

d = (m21 * e) + bf = (m24 * e) + g

… …

Out = o

}

Behavioral Specification

Page 31: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

31

244 Lecture Notes 11-1

61

Treat the given description as a “dataflow” type specification.

Synthesize a datapath by first constructing a schedule of operations, that will require a certain amount of hardware, and then synthesize a controller given the datapathand associated schedule.

Control is viewed totally as a by-product ofdatapath synthesis.

Hardware Allocation

62

V3 = V1 + V2V5 = V3 - V4V7 = V3 * V6V8 = V3 + V5V9 = V1 + V7V11 = V10 / V5V13 = V3V12 = V1V14 = V11 and V8V15 = V12 or V9V1 = V14V2 = V15

Example Behavior

Page 32: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

32

244 Lecture Notes 11-1

63

A schedule:

r3 = r1 + r2 r8 = r1r7 = r3 - r4r2 = r3 * r5r3 = r3 + r7r2 = r1 + r2r7 = r6 / r7r1 = r3 and r7r2 = r2 or r8

V1 to V15 merged into r1 to r8

Schedule - I

64

ALUr1 r2 r3 r4 r5 r6 r7 r8

1

2 3

4

5 6 7

8

9 10 11

12

13 14

15 16

r3 = r1 + r2 r8 = r1r7 = r3 - r4r2 = r3 * r5r3 = r3 + r7r2 = r1 + r2r7 = r6 / r7r1 = r3 and r7r2 = r2 or r8

Datapath - I

ALU performs +, -, *, /, and, or

Page 33: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

33

244 Lecture Notes 11-1

65

Controller - I

Controller:

0 S0 S0 N0P 0000000000000000

1 S0 S1 N0P 0000000000000000

- S1 S2 ADD 1111010010000010

- S2 S3 SUB 1110000101000100

- S3 S4 MUL …

- S4 S5 ADD …

- S5 S6 ADD

- S6 S7 DIV

- S7 S8 AND

- S8 S0 OR …

66

r3 = r1 + r2 r8 = r1r5 = r3 - r4 r2 = r3 * r6r3 = r3 + r5 r5 = r5 * r7r2 = r1 and r2 r1 = r3 or r5r2 = r8 + r2

ALU1

1 2

3

and+-

4

6

5

ordiv*

r1 r2 r3 r4 r5 r6 r7 r8

7 8

9

10 11

15

13

14

12

16

17 18 19

20

21

2223

Schedule and Datapath - II

ALU2

Page 34: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

34

244 Lecture Notes 11-1

67

Controller

0 S0 S0 NOP NOP 000000000000000000000001 S0 S1 NOP NOP 00000000000000000000000- S1 S2 ADD NOP 11100001000100100000010- S2 S3 SUB MUL … - S3 S4 ADD DIV … - S4 S5 AND OR- S5 S6 ADD NOP …

Controller - II

68

Arithmetic unit allocation: Decide the number of arithmetic units, as well as the grouping of arithmetic operators

e.g., ALU1: + , -ALU2: + , -, incrALU3: * , shift

or ALU1: + , - , incr , shiftALU2: *

Register allocation: Coalesce the variables into a minimum number of registers

Interconnect allocation: Allocate buses and links, to implement all the required data transfers between the registers and ALUs.

Allocation Problems

Page 35: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

35

244 Lecture Notes 11-1

69

Call the X dimension (hardware) space and the Y dimension (execution) time

Minimum number of ALUs:– Maximum number of occupied space slots.

Each ALU performs operations in its slot.

Minimum number of registers:– Maximum density of variable lifetimes across

all the time slots.

Interconnections related to the number of parallel data transfers in each time slot.

SpaceTime

Z1 = X1 + Y1K1 = Z1 - X2L1 = Z2 or Z3

Z2 = X1 * Y2K2 = Z2 @ X1

Z3 = X2

K3 = K2 + 1

Placement Formulation

70

Given a code sequence, coalesce the variables into a minimum number of registers.

Define the lifetime of a variable as the interval between its first assignment and last use.

Merge variables with non-overlapping lifetimes

Lifetimes are open intervals, at bottom end.

Register Allocation

V1 = V2 + V3

V4 = V2 - V3

V5 = V1 * V2

V6 = V4 and V3

V7 = V5 or V6

V1 V2 V3 V4 V5 V6 V7

Page 36: RTL and Behavioral Synthesis - users.encs.concordia.causers.encs.concordia.ca/~tahar/coen6551/notes/synth-notes.pdf · 1 244 Lecture Notes 11-1 1 RTL and Behavioral Synthesis Srinivas

36

244 Lecture Notes 11-1

71

Maximum density of variable lifetimes across all timeslots = minimum # registers required

Merge variables into # registers = maximum density

Different schedules have different max. densities.

R1 = R2 + R3R4 = R2 - R3R1 = R1 * R2R4 = R4 and R3R4 = R1 or R4

(V1, V5) R1

(V4, V6, V7) R4

Register Allocation - II

V1 = V2 + V3

V4 = V2 - V3

V5 = V1 * V2

V6 = V4 and V3

V7 = V5 or V6

V1 V2 V3 V4 V5 V6 V7

V2 R2

V3 R3