rtl and behavioral synthesis -...
TRANSCRIPT
1
244 Lecture Notes 11-1
1
RTL and BehavioralSynthesis
Srinivas Devadas
MIT
Kurt Keutzer
Richard Newton
UC Berkeley
2
An RTL description is always implicitly structural -
• the registers and their interconnectivity are defined
• Thus the clock-to-clock behavior is defined
• Only the control logic for the transfers is synthesized.
This approach can be enhanced:
• Register inferencing
• Automating resource allocation
+
+1
*
RTL implies
b
c
a
d
e2
RTL level
a = b + c;d = a + 1;e = d * 2;
Behavioral implies
e = 2 * (b + c + 1)
2
244 Lecture Notes 11-1
3
An behavioral description is always functional
• Temporal relationships are only expressed as precedences
• An entire architecture is synthesized from the behavioral description
The two key elements of behavioral :
• Automating resource allocation (could be RTL also)
• Schedule
+
+1
*
RTL implies
b
c
a
d
e2
Behavioral level
a = b + c;d = a + 1;e = d * 2;
Behavioral implies
e = 2 * (b + c + 1)
4
Two Steps:
1. Translation from RTL description into gates
2. Optimization of logic
Ideally, all of the intelligence should be in the optimization step
Unfortunately, starting point for optimization affects results
⇒ Need to write a “good” RTL description
RTL Synthesis
3
244 Lecture Notes 11-1
5
entity VHDL isport C (
A, B, C : in BIT ;Z : out BIT ;
) ;end VHDL ;
architecture VHDL_1 of VHDL isbegin
Z ⇐ (A and B) or C ; }end VHDL_1 ;
entity and port declarationsdefine interface
definesimplementation
logic expression
AB
C
VHDL
Z
6
process beginwait until CLOCK’ event and CLOCK = ‘1’ ;if (ENABLE = ‘1’) then
TOGGLE = not TOGGLE;end if;
end process;
wait infers a flip-flop
>
TOGGLEENABLE
CLOCK
Wait Statements
4
244 Lecture Notes 11-1
7
Current scenario: Designer iterates over RTL descriptions using simulation and synthesis tools to verify and evaluate design
PROBLEM: Equivalent RTL descriptions may result in dramatically different logic circuits in terms of area/performance EVEN AFTER OPTIMIZATION.
Writing RTL Descriptions
8
push multiplexors to inputs
_
+
a
b
SUB
result1
0
result
a
b
-b
SUB
Equivalent RTL Descriptions
if (SUB)result = a - b;
elseresult = a + b;
if (SUB)b = -b;
result = a + b;
+1
0
5
244 Lecture Notes 11-1
9
if (A) then {if (C) then
…}if (A) then { avoid
if (C) then { unnecessary} nested if statements
… }
AC
CA
False Paths
1
0 1
0
10
1. Increase power of optimization.
2. Define “good” input and restrict oneselfto writing “good” input descriptions.
3. Manipulate input descriptions to make them “good”.
Rules for writing descriptions based onknowledge that a library of descriptions are good starting points for synthesis.
Practical Solution
6
244 Lecture Notes 11-1
11
Languages versus Models
⊕
D D
D
D D
D
⊕ ⊕
⊕⊕ ⊕
⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕
⊕⊕ ⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕⊗
⊗ ⊗ ⊗ ⊗
⊗
⊗ ⊗
D
Esterel MatlabVHDLUML“C, C++”
DiscreteEvent
SynchronousReactive
“FSMs” C forRTOS
FPGA-Based“Von Neumann”
ProcessorHard-Wired
ASIC RTOS
Language
Model ofComputation
ImplementationMedia
12
Languages versus Models
⊕
D D
D
D D
D
⊕ ⊕
⊕⊕ ⊕
⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕
⊕⊕ ⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕⊗
⊗ ⊗ ⊗ ⊗
⊗
⊗ ⊗
D
Esterel MatlabVHDLUML“C, C++”
DiscreteEvent
SynchronousReactive
“FSMs” C forRTOS
FPGA-Based“Von Neumann”
ProcessorHard-Wired
ASIC RTOS
7
244 Lecture Notes 11-1
13
Languages versus Models
⊕
D D
D
D D
D
⊕ ⊕
⊕⊕ ⊕
⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕
⊕⊕ ⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕⊗
⊗ ⊗ ⊗ ⊗
⊗
⊗ ⊗
D
Esterel MatlabVHDLUML“C, C++”
SynchronousReactive
FPGA-Based“Von Neumann”
ProcessorHard-Wired
ASIC RTOS
SLSMSLSMMLSMMLSM
(really just syntax)
14
Languages versus Models:A Software Analogy
f77
"C"
Lisp
Pascal
Von Neumann Computer
Intermediate Form
⊕
D DD
D D
D
⊕ ⊕
⊕⊕ ⊕
⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕
⊕⊕ ⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕⊗
⊗ ⊗ ⊗ ⊗
⊗
⊗ ⊗
D
UDL-I VHDL
Silage
Hardware Implementation
Hardware Intermediate Form
8
244 Lecture Notes 11-1
15
Behavioral Synthesis Features
Scheduling of operations– Operation Chaining and Multi-cycling– Different Scheduling Modes, and constraints
Memory Inferencing– Array’s can be mapped directly to RAM– Automatic control/data/address generation– RAM tradeoffs (1 port vs. 2 port) are easy
Using Pipelined Components– Multipliers, Adders, RAMs, others
Pipelining Loop with various Throughput & Latency
Allocation of resources– Building your own design components
Automatic generation of FSM Controller and integration of Datapath
16
What is an Architecture?
m Amdahl, Blaauw, and Brooks, 1964, defined three interfaces:
Computer Architecture: "The attributes of a computer as seen by a machine language programmer."
Implementation:"Actual hardware structure, logic design, anddatapath organization."
Realization: "Encompasses the logic technologies, packaging, and interconnection."
9
244 Lecture Notes 11-1
17
What is an Architecture?
m A description of the behavior of a system that is independent of its implementation.
Ô Isomorphic to its "interface specification" (Siewiorek, Bell, Newell, 1971)
For example:Ô Instruction set definition of a computerÔ Z-domain description of a filterÔ Handshaking protocol for a bus
m May guide implementation (contain 'hints' or 'pragmas')e.g. a particular specification may lead naturally to a serial or parallel
implementation.
18
What is an Architecture?
Example: Instruction set
Inst_A Inst_B Inst_C ... Inst_C may not follow Inst_A
Inst_A Inst_B Inst_C ... Inst_C may not follow Inst_A because Inst_A uses a scratchpad register that Inst_C will over-write.
Architecture Not architecture
10
244 Lecture Notes 11-1
19
Specification vs. DescriptionSpecification: Saying what I want; describes behavior in terms of
results.e.g. ∀A { A[i,j] ← 0}
Description: Saying how to do it; describes behavior in terms of procedure or process.e.g. for(i=0; i<N; i++)
for(j=0; j<M; j++)A[i][j] = 0;
We do not have specification languages for general-purpose digital design. For some special-purpose applications (e.g. DSP) we do.
20
Level of Acceptance of SynthesisTechniques
Source: A. DeGeus
S pe e d o f De s igne r
Expe rt de s igne r
Behavio ral Synthe s is
Ac ce ptanc e c urve
Quality o f De s ign
Sequentia l Synthes is
Co mbinatio nal Sy nthe s is
11
244 Lecture Notes 11-1
21
Tradeoff Example - Echo Canceler
22
Echo Canceler - Tap
Total:6 Taps
12
244 Lecture Notes 11-1
23
Tap - Complex Multiplication
16x2->18multiplication
17x17->17addition
17x17->17subtraction
Total: 6 complex multiplications
24
Echo Canceler Complexity
Clock cycle : 30ns
Complexity : around 10K to 20K gates
Modules : 16x2 multiplier : 22ns 713 gates
17x17 add/sub : 20ns 193 gates
Source : ~2000 VHDL excl. comments
operations handled by BC operations handled by DC
addition : 30 constant : 11subtraction : 20 logic oper. : 66multiplication : 24 word expand : 100compare oper. : 2 truncation : 124data select : 29 shift oper. : 62delay elements : 24Total : 492
13
244 Lecture Notes 11-1
25
Echo Canceler Source Codeuse work.COSSAP_PACKAGE_SYNOPSYS.all ;
. . . . . . . . . port ( IN_R : in STD_LOGIC_VECTOR((WORDLENGTHIN -1) downto 0) ;. . . . . . . . . . . . . .
architecture behavior of ec is. . . . . . . . . . . .ot := (in2 and mask) or (in1 and not mask) ;delay_line_M_6_7_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_7_8 := (others => (others => '0'));delay_line_M_6_7_5 := (others => (others => '0'));delay_line_M_6_7_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_6_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_6_8 := (others => (others => '0'));delay_line_M_6_6_5 := (others => (others => '0'));
delay_line_M_6_11_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_14_7 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));delay_line_M_6_14_8 := (others => (others => '0'));delay_line_M_6_14_5 := (others => (others => '0'));delay_line_M_6_14_12 := (others => STD_LOGIC_VECTOR(conv_signed(0,16)));wait untilCLOCK'event and CLOCK='1';
M_6_7_11_11_SIG_0010 := SIGNED(M_6_7_11_11_SIG_0008) + SIGNED(M_6_7_11_11_SIG_0009);
M_6_7_11_12_SIG_0010 := SIGNED(M_6_7_11_12_SIG_0009) -SIGNED(M_6_7_11_12_SIG_0008);
M_6_7_9_11_SIG_0010 := SIGNED(M_6_7_9_11_SIG_0008) + SIGNED(M_6_7_9_11_SIG_0009);
M_6_7_9_12_SIG_0010 := SIGNED(M_6_7_9_12_SIG_0009) -SIGNED(M_6_7_9_12_SIG_0008);
M_6_7_6_SIG_0006 := signed(M_6_7_6_SIG_0017) * signed(M_6_7_6_SIG_0024);M_6_7_6_SIG_0009 := signed(M_6_7_6_SIG_0022) * signed(M_6_7_6_SIG_0019);M_6_7_6_SIG_0016 := signed(M_6_7_6_SIG_0018_1) * signed(M_6_7_6_SIG_0020);M_6_7_6_SIG_0013 := signed(M_6_7_6_SIG_0021) * signed(M_6_7_6_SIG_0023);
delay_line_M_6_6_5 := M_6_6_SIG_0034 & delay_line_M_6_6_5(1 to 1 -1);delay_line_M_6_6_5(1) := M_6_6_SIG_0034; delay_line_M_6_6_8 := M_6_6_SIG_0025 & delay_line_M_6_6_8(1 to 1 -1);delay_line_M_6_6_8(1) := M_6_6_SIG_0025;
M_6_6_11_11_SIG_0010 := SIGNED(M_6_6_11_11_SIG_0008) + SIGNED(M_6_6_11_11_SIG_0009);
M_6_6_11_12_SIG_0010 := SIGNED(M_6_6_11_12_SIG_0009) -SIGNED(M_6_6_11_12_SIG_0008);
M_6_6_9_11_SIG_0010 := SIGNED(M_6_6_9_11_SIG_0008) + SIGNED(M_6_6_9_11_SIG_0009);
M_6_6_9_12_SIG_0010 := SIGNED(M_6_6_9_12_SIG_0009) -SIGNED(M_6_6_9_12_SIG_0008);
M_6_6_6_SIG_0006 := signed(M_6_6_6_SIG_0017) * signed(M_6_6_6_SIG_0024);M_6_6_6_SIG_0009 := signed(M_6_6_6_SIG_0022) * signed(M_6_6_6_SIG_0019);M_6_6_6_SIG_0016 := signed(M_6_6_6_SIG_0018_1) * signed(M_6_6_6_SIG_0020);
M_6_6_11_11_SIG_0010 := SIGNED(M_6_6_11_11_SIG_0008) + SIGNED(M_6_6_11_11_SIG_0009);
M_6_6_11_12_SIG_0010 := SIGNED(M_6_6_11_12_SIG_0009) -SIGNED(M_6_6_11_12_SIG_0008);
M_6_6_9_11_SIG_0010 := SIGNED(M_6_6_9_11_SIG_0008) + SIGNED(M_6_6_9_11_SIG_0009);
M_6_6_9_12_SIG_0010 := SIGNED(M_6_6_9_12_SIG_0009) -SIGNED(M_6_6_9_12_SIG_0008);
M_6_6_6_SIG_0006 := signed(M_6_6_6_SIG_0017) * signed(M_6_6_6_SIG_0024);
M_6_6_6_SIG_0009 := signed(M_6_6_6_SIG_0022) * signed(M_6_6_6_SIG_0019);
M_6_6_6_SIG_0016 := signed(M_6_6_6_SIG_0018_1) * signed(M_6_6_6_SIG_0020);
M_6_6_6_SIG_0013 := signed(M_6_6_6_SIG_0021) * signed(M_6_6_6_SIG_0023);
delay_line_M_6_5_5 := M_6_5_SIG_0034 & delay_line_M_6_5_5(1 to 1 -1);delay_line_M_6_5_5(1) := M_6_5_SIG_0034; delay_line_M_6_5_8 := M_6_5_SIG_0025 & delay_line_M_6_5_8(1 to 1 -1);delay_line_M_6_5_8(1) := M_6_5_SIG_0025;
M_6_5_11_11_SIG_0010 := SIGNED(M_6_5_11_11_SIG_0008) + SIGNED(M_6_5_11_11_SIG_0009);
M_6_5_11_12_SIG_0010 := SIGNED(M_6_5_11_12_SIG_0009) -SIGNED(M_6_5_11_12_SIG_0008);
M_6_5_9_11_SIG_0010 := SIGNED(M_6_5_9_11_SIG_0008) + SIGNED(M_6_5_9_11_SIG_0009);
M_6_5_9_12_SIG_0010 := SIGNED(M_6_5_9_12_SIG_0009) -SIGNED(M_6_5_9_12_SIG_0008);
M_6_5_6_SIG_0006 := signed(M_6_5_6_SIG_0017) * signed(M_6_5_6_SIG_0024);
M_6_5_6_SIG_0009 := signed(M_6_5_6_SIG_0022) * signed(M_6_5_6_SIG_0019);
M_6_5_6_SIG_0016 := signed(M_6_5_6_SIG_0018_1) * signed(M_6_5_6_SIG_0020);
M_6_5_6_SIG_0013 := signed(M_6_5_6_SIG_0021) * signed(M_6_5_6_SIG_0023);
delay_line_M_6_4_5 := M_6_4_SIG_0034 & delay_line_M_6_4_5(1 to 1 -1);delay_line_M_6_4_5(1) := M_6_4_SIG_0034; delay_line_M_6_4_8 := M_6_4_SIG_0025 & delay_line_M_6_4_8(1 to 1 -1);delay_line_M_6_4_8(1) := M_6_4_SIG_0025;
M_6_4_11_11_SIG_0010 := SIGNED(M_6_4_11_11_SIG_0008) + SIGNED(M_6_4_11_11_SIG_0009);
M_6_4_11_12_SIG_0010 := SIGNED(M_6_4_11_12_SIG_0009) -SIGNED(M_6_4_11_12_SIG_0008);
M_6_4_9_11_SIG_0010 := SIGNED(M_6_4_9_11_SIG_0008) + SIGNED(M_6_4_9_11_SIG_0009);
M_6_4_9_12_SIG_0010 := SIGNED(M_6_4_9_12_SIG_0009) -SIGNED(M_6_4_9_12_SIG_0008);
M_6_4_6_SIG_0006 := signed(M_6_4_6_SIG_0017) * signed(M_6_4_6_SIG_0024);
M_6_4_6_SIG_0009 := signed(M_6_4_6_SIG_0022) * signed(M_6_4_6_SIG_0019);
M_6_4_6_SIG_0016 := signed(M_6_4_6_SIG_0018_1) * signed(M_6_4_6_SIG_0020);
26
Scheduling Tradeoffs I
* *
<
Design #2 Design #3
Design Tradeoff with Scheduling
**
+
<
+ **
<+
**
<+
Design #1
clock 50ns2 fast multipliers
3 cycles
clock 50ns1 fast multipliers
4 cycles
clock 50ns2 slow multipliers
4 cycles
multi-cycling
14
244 Lecture Notes 11-1
27
Scheduling Tradeoffs II
Design #5
Tradeoff:
Clock speed vs Latency vs Resources
*
*
<+
clock 50ns1 pipeline multiplier
5 cycles
Design #4
**
<+
clock 100ns2 slow multipliers
2 cycles
chaining
2-stagepipelinedmultiplier
28
Scheduling Tradeoff Results
Tradeoff: Resources, Area &Number of cycles
# multiplier# adderTotal Area
4 8 16 24 32 40 48 # cycles
Area
10
42 2 1 1 1
15
7
4 52 2 1
15
244 Lecture Notes 11-1
29
Scheduling Modes
Cycle-Fixed Mode– In this mode, the exact cycle times of read and write
operations on signals are preserved. Operations on variables are free to move around.
Superstate-Fixed Mode– In this mode, the relative times of read and write
operations are preserved, but superstates(delimited by wait statements) may be stretched. Operations on variables are free to move around.
Free-floating Mode– In this mode, computations, read and write operations
may flow freely past wait statements
In all cases, user can provide constraints to control the scheduling process.
30
Approaches to Scheduling
(1) Exhaustive ApproachesÔ e.g. EXPL (Barbacci , 1962): exhaustive search, Hafer & Parker: Branch & bound
(2) As-Soon-As-Possible (ASAP)Ô Used by many pioneering systems (e.g. early CMUDA, MIMOLA, and Flamel )
16
244 Lecture Notes 11-1
31
As-Soon-As-Possible (ASAP) andAs-Late-As-Possible (ALAP) Scheduling
⊕⊗
⊗⊗ ⊗ ⊕
⊗ <
1
2
3
4
⊗
⊕
⊗⊗⊗
⊗⊕⊗
<
⊗
ASAP ALAP
32
Approaches to Scheduling(3) List Scheduling
Ô Keep list of operations available to be scheduled (i.e. whose predecessors have already been scheduled) and order them for scheduling using some priority function
• BUD - length of path from operation to end of enclosing block• Elf, ISYN - "urgency": length of shortest path from operation to
nearest local constraintÔ O(C NlogN) time complexity
(4) Force-Directed ApproachesÔ "Force" between operation and control-step proportional to the number of operations
of the same type that could go in that control stepÔ Scheduling to minimize force tends to minimize the number of resources needed
(e.g. HAL)
Ô O(C N^2) time complexity
17
244 Lecture Notes 11-1
33
Minimizing Resources UsingForce-Directed Techniques
⊕
⊗⊗⊗
⊗⊕
⊗<
1
2
3
4
⊗⊕
⊗⊗⊗
⊗⊕
⊗<
×± L
21 -
2- 1
21 -
- 2 -
×± L
21 -
21 -
11 -
- 1 1
34
Basic Force-Directed Algorithm
repeat {
(1) evaluate ASAP and ALAP limits for each operation;(2) create or update distribution graphs;(3) calculate the self-force for every unscheduled
operation and feasible control step;(4) add predecessor and successor forces to self-force;(5) schedule the operation with lowest force;
} until ( all operations are scheduled;)
18
244 Lecture Notes 11-1
35
5th-Order Wave-Digital Filter
⊕
D D
D
D D
D
⊕ ⊕
⊕⊕ ⊕
⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕
⊕⊕ ⊕
⊕⊕ ⊕
⊕⊕
⊕
⊕ ⊕⊗
⊗ ⊗ ⊗ ⊗
⊗
⊗ ⊗
D
m 26 additions and 8 multiplies, as specified
36
5th Order Wave-Digital Filter Dataflow Graph
++
+
+
+
++
++
+
++
++
++
+
++
++
+
+
++
+
××
××
×
××
×
12111091 2 3 4 5 6 7 8 13 14 15 16 17 18
19
244 Lecture Notes 11-1
37
Results from List Scheduling
+++
+
+++
12
+++ +
+ ++
13
14
15
16
17
18
19
++
++
+++ +
++ +
+
× ×
× ×
××
×
×
Case 1 Case 2
38
ASAP / ALAP Schedules for Case 1
+++ ++
12
+++ +
+
+
+
+
++
13
14
15
16
17
13
14
15
16
17
DG: addDG: mult
1 2 3 4
Resources: add, mult
× × × ××
20
244 Lecture Notes 11-1
39
ASAP / ALAP Schedules for Case 2
+
+
++
12
13
14
15
16
17
18
+++
++
+
×++
×+
++
+
+
+
×++
13
14
15
16
17
18
DG: addDG: multResources: add, mult
1 2 3 4
×× ×
×
40
Representative Results for 5th Order WDF
Synthesis System Adders Multplrs Cycles
FDS(HAL) 2 2 19
APARTY(SAW), CMU 2 2 19
CADDY 2 2 18
FDLS(HAL) 2 2 18
<ESC> 2 2 18
HAL, CADDY, Nische 2 1p 19
SPAID, <ESC>
21
244 Lecture Notes 11-1
41
Memory Types, Embedded
Memory Types:
• Synchronous : Inputs & outputs occur at clock edge
• Asynchronous : Unclocked signals
• Single Port: Either read or write to one address location
• Dual/Multi Port: Two fully independent read/write ports
DualPort
Data In A
Data In B
Data Out A
Data Out B
Address A Address B
Enable A
SinglePort
Data In Data Out
Address
Enable Enable B
42
Memory Inferencing------ RAM Declaration Section -------type word is integer range 0 to 255;variable a : array N downto 1 of word;variable b : array M downto N + 1 of word;constant MY_RAM : resource := 0;attribute variables of MY_RAM is “a b”;attribute map_to_module of MY_RAM is “DW03_ram1_s_d”;
---- RAM Read and Write Example -------a(i) := x * y; -- writeSUM := a(i) + b(N+i); -- read
/***** RAM Declaration Section *****/reg [7:0] a[N:0], b[M:N+1];/* synopsys resource MY_RAM: variables = “a b”,
map_to_module = “DW03_ram1_s_d”; */
/***** RAM Read and Write Example *****/a[i] = x * y;SUM = a[i] + b[N+i];
Support for Single Port, Dual Port, Etc.
Several arrays can be mapped to the same RAM
22
244 Lecture Notes 11-1
43
Memory Tradeoff Example/***** RAM Declaration Section *****/reg [7:0] p[0:63], q[0:63];
/* synopsys resource MY_RAM: variables = “p”, map_to_module =
“DW03_ram1_s_d”; *//* synopsys resource MY_RAM: variables = “q”,
map_to_module = “DW03_ram1_s_d”; */. . . . . . . .
begin : countingforever begin/***** RAM Read and Write Example *****/incm_p : for (i=0; i<63; i=i-1) begin
q[i] = p[i] + 1; /* This will be unrolled */end
end end // counting
DualPort
Data In A
Data In B
Data Out A
Data Out B
Address A Address B
Enable A
SinglePort
Data In Data Out
Address
Enable Enable B
44
Memory Tradeoff Results
Tradeoff: RAM read/write ports, Area & Latency
1 2 3 4
# port in RAM
Latency
read 1cwrite 2c
read 1cwrite 2c
read 1cwrite 1c
read 1cwrite 1c
latency132
latency131
latency67
latency67
# R/W cycles
23
244 Lecture Notes 11-1
45
Loop PipeliningFor high-throughput designs
– Pipelined to start every two cycles, Operations overlapped.
– Latency remains the same. Throughput increases and input/output streams become more regular.
i +
1
j+
+
A(i) B(i)
Z(i)A(j) B(j)
Z(j)
i +
1
j+
+
A(i) B(i)
Z(i)
A(j) B(j)
Z(j)
i +
1
j
With Loop Pipelining…
…Significantly Higher Throughput
Without Loop Pipelining
46
Loop Pipelining Tradeoffs
**
<+
Without Loop pipelining:
Latency 4 cyclesThroughput 4 cycles
1 fast multiplier
**
<+ *
*
<+
**
<+
**
<+
Latency 4 cyclesThroughput 1 cycles
2 fast multipliers
Latency 4 cyclesThroughput 2 cycles
1 fast multiplier
24
244 Lecture Notes 11-1
47
Loop Pipelining Tradeoffs ResultsTradeoff: Resources, Initiation, Throughput & Area
# cycles
# multiplier# adderTotal Area
Area
4
910
16
throughputlatency
4 cycles8 cycles
8 cycles16 cycles
3
7
12 cycles24 cycles
16 cycles32 cycles
2
6
48
Pipelined Components• If faster multiplier required, use
pipelined multipliers from design library- Two-stage- Three-stage
• Benefit of using design library- Behavioral synthesisautomaticallygenerates control logic- Better area efficiency- Resource sharing
Write SourceCode
Check Timing(DC Output)
If Multiplier onCritical Path
• Pipeline Multiplier
• Modify Source Codeto Add Pipeline (States)
Change y = a * b toy = mult_2s(a,b,clk)
RTL synthesis
BehavioralSynthesis
Data 1
Data 1 Result 1 Data 2 Result 2
Clock
Multiplier
New Clock
PipelineMultiplier
P Res 1 Res 1Data 2
Data 3P Res 2 Res 2
Res 3P Res 3
25
244 Lecture Notes 11-1
49
Pipelined Tradeoff Results
Tradeoff: Resources, Area & Number of cycles
# pipeline multiplier
# adderTotal Area
4 8 16 24 32 40 48 # cycles
Area
12
42 2 1 1 1
16
10
7 64 3 5
50
Creating customized design library parts
Incremental Step for defining design library components for Behavioral Synthesis
Source Design(.vhd, .v, .db)
Synthesis Directives(loads, drives, etc)
designlibrary
developer
Semiconductor Vendor
Design LibraryComponents
Your CustomDesign LibraryComponents
SynopsysDesign LibraryComponents
Design LibraryInput
Library (.sl)
26
244 Lecture Notes 11-1
51
Identify the common expression
**+
+
**+
**+
**+
+ +
Implemented as design library component
function [15:0] sum_of_product;// map_to_operator dw_sop// return_port_name szinput [7:0] a, b, c, d;
beginsum_of_product = a * b + c * d;
end
**+
52
Creating the implementation
function sum_of_product (a, ..: input UNSIGNED)return UNSIGNED is
-- pragma map_to_operator dw_sop-- pragma return_port_name sz
begin........end;
Function Definition
Synthetic Operator
Synthetic Module
Implementations
dw_sop
Small Fast
dw_sop_mod .sl file
a
b*
c
d*
sz+
27
244 Lecture Notes 11-1
53
Other Features included
Bit-level timing of design library components– Does the operation fit within a clock?
– Do multiple operations fit within a clock?
When should the operations be performed– 3rd, 4th cycle?– How many operations in each clock cycle?
Register assignment– Where should temporary values be stored?– How long are the inputs, temporary values needed?
Is the mux cost more expensive than sharing a resource?
54
Summary of all Tradeoff Results
18 different design tradeoffs
in 3 days
4 8 16 24 32 40 48 Throughput
Area
with pipeline multiplier
with loop pipeliningwith different schedule
28
244 Lecture Notes 11-1
55
Behavioral Design BenefitsBenefits:
– more abstraction• specifying functionality instead of
implementation
– faster simulation– system level design
• latency, #cycles, #resources, clock period
• throughput and area correlation– better quality of result
• multi-cycle optimization
• optimization not limited by register boundaries
– automatic generation of FSM
Latency
Area
56
RTL synthesis is mature and synthesis from RTL descriptions is the most popular means of integrated circuit design
Behavioral synthesis is much less mature– Initial use in DSP design (constrained problems:
such as register allocation given a schedule)– Growing use in video (lots of arithmetic),
networking (complex schedules), and gradually, general ASIC design
Summary
29
244 Lecture Notes 11-1
57
SYNTHESIS REFERENCES
Overviews:– Giovanni De Micheli,Synthesis and Optimization of Digital
Circuits, McGraw-Hill 1994 – S. Devadas, A. Ghosh, K. Keutzer, Logic Synthesis, McGraw-
Hill 1994
Particularly for DSP:– S. Note et al. ``Combined Hardware Selection and Pipelining in
High-Level Performance Data-Path Design,’’ IEEE Transactions on CAD/ICAS, Vol. CAD 11, pp. 413-423, April 1992.
– J. Rabaey et al. ``Cathedral II: A Synthesis System for Multiprocessor DSP Systems,’’ in Silicon Compilation, Edited by Dan Gajski, Addison-Wesley 1988.
58
Extras
30
244 Lecture Notes 11-1
59
Manipulating and optimizing RTL descriptions while maintaining required functionality but changing FSM behavior
Behavioral transformations
– Parallelize, serialize or pipeline a behavioral description
– Allocate hardware; registers buses, ALUs to obtain a RTL description with a definite structure
Behavioral Synthesis
60
Elliptic-filter (In, Out)
{i = Ina = i + t2b = a + t13g = t33 + t39e = g + t26 + b
d = (m21 * e) + bf = (m24 * e) + g
… …
Out = o
}
Behavioral Specification
31
244 Lecture Notes 11-1
61
Treat the given description as a “dataflow” type specification.
Synthesize a datapath by first constructing a schedule of operations, that will require a certain amount of hardware, and then synthesize a controller given the datapathand associated schedule.
Control is viewed totally as a by-product ofdatapath synthesis.
Hardware Allocation
62
V3 = V1 + V2V5 = V3 - V4V7 = V3 * V6V8 = V3 + V5V9 = V1 + V7V11 = V10 / V5V13 = V3V12 = V1V14 = V11 and V8V15 = V12 or V9V1 = V14V2 = V15
Example Behavior
32
244 Lecture Notes 11-1
63
A schedule:
r3 = r1 + r2 r8 = r1r7 = r3 - r4r2 = r3 * r5r3 = r3 + r7r2 = r1 + r2r7 = r6 / r7r1 = r3 and r7r2 = r2 or r8
V1 to V15 merged into r1 to r8
Schedule - I
64
ALUr1 r2 r3 r4 r5 r6 r7 r8
1
2 3
4
5 6 7
8
9 10 11
12
13 14
15 16
r3 = r1 + r2 r8 = r1r7 = r3 - r4r2 = r3 * r5r3 = r3 + r7r2 = r1 + r2r7 = r6 / r7r1 = r3 and r7r2 = r2 or r8
Datapath - I
ALU performs +, -, *, /, and, or
33
244 Lecture Notes 11-1
65
Controller - I
Controller:
0 S0 S0 N0P 0000000000000000
1 S0 S1 N0P 0000000000000000
- S1 S2 ADD 1111010010000010
- S2 S3 SUB 1110000101000100
- S3 S4 MUL …
- S4 S5 ADD …
- S5 S6 ADD
- S6 S7 DIV
- S7 S8 AND
- S8 S0 OR …
66
r3 = r1 + r2 r8 = r1r5 = r3 - r4 r2 = r3 * r6r3 = r3 + r5 r5 = r5 * r7r2 = r1 and r2 r1 = r3 or r5r2 = r8 + r2
ALU1
1 2
3
and+-
4
6
5
ordiv*
r1 r2 r3 r4 r5 r6 r7 r8
7 8
9
10 11
15
13
14
12
16
17 18 19
20
21
2223
Schedule and Datapath - II
ALU2
34
244 Lecture Notes 11-1
67
Controller
0 S0 S0 NOP NOP 000000000000000000000001 S0 S1 NOP NOP 00000000000000000000000- S1 S2 ADD NOP 11100001000100100000010- S2 S3 SUB MUL … - S3 S4 ADD DIV … - S4 S5 AND OR- S5 S6 ADD NOP …
Controller - II
68
Arithmetic unit allocation: Decide the number of arithmetic units, as well as the grouping of arithmetic operators
e.g., ALU1: + , -ALU2: + , -, incrALU3: * , shift
or ALU1: + , - , incr , shiftALU2: *
Register allocation: Coalesce the variables into a minimum number of registers
Interconnect allocation: Allocate buses and links, to implement all the required data transfers between the registers and ALUs.
Allocation Problems
35
244 Lecture Notes 11-1
69
Call the X dimension (hardware) space and the Y dimension (execution) time
Minimum number of ALUs:– Maximum number of occupied space slots.
Each ALU performs operations in its slot.
Minimum number of registers:– Maximum density of variable lifetimes across
all the time slots.
Interconnections related to the number of parallel data transfers in each time slot.
SpaceTime
Z1 = X1 + Y1K1 = Z1 - X2L1 = Z2 or Z3
Z2 = X1 * Y2K2 = Z2 @ X1
Z3 = X2
K3 = K2 + 1
Placement Formulation
70
Given a code sequence, coalesce the variables into a minimum number of registers.
Define the lifetime of a variable as the interval between its first assignment and last use.
Merge variables with non-overlapping lifetimes
Lifetimes are open intervals, at bottom end.
Register Allocation
V1 = V2 + V3
V4 = V2 - V3
V5 = V1 * V2
V6 = V4 and V3
V7 = V5 or V6
V1 V2 V3 V4 V5 V6 V7
36
244 Lecture Notes 11-1
71
Maximum density of variable lifetimes across all timeslots = minimum # registers required
Merge variables into # registers = maximum density
Different schedules have different max. densities.
R1 = R2 + R3R4 = R2 - R3R1 = R1 * R2R4 = R4 and R3R4 = R1 or R4
(V1, V5) R1
(V4, V6, V7) R4
Register Allocation - II
V1 = V2 + V3
V4 = V2 - V3
V5 = V1 * V2
V6 = V4 and V3
V7 = V5 or V6
V1 V2 V3 V4 V5 V6 V7
V2 R2
V3 R3