very good notes-up2

8/2/2019 Very Good Notes-up2

1/304

E&CE 427: Digital Systems Engineering

Course Notes

Mark Aagaard

2006t3Fall

University of Waterloo

Dept of Electrical and Computer Engineering

September 18, 2006


2/304


3/304


4/304


5/304


6/304


7/304


8/304


9/304


10/304


11/304

CONTENTS xix

P10.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

11 Problems on Faults, Testing, and Testability 99

P11.1Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . . . 99

P11.2Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

P11.3Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

P11.4Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

P11.5Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . . . 103

P11.6Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

P11.7Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

P11.7.1Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

P11.7.2Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

P11.8Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104P11.9BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 05

P11.9.1Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 105

P11.9.2Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

P11.9.3Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

P11.9.4Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 111

P11.9.5Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 112

P11.9.6Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . . . . . . 112

P11.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

P11.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 14

P11.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

P11.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 16

P11.12.1Are there any physical faults that are detectable by scan testing but not by

built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

P11.12.2Are there any physical faults that are detectable by built-in self testing but

not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

P11.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 17

P11.13.1Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

P11.13.2Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 117

P11.13.3Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . . . 118

P11.13.4Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Part I

Course Notes

1


12/304

Chapter 1

VHDL: The Language

1.1 Introduction to VHDL

1.1.1 Levels of Abstraction

There are many different levels of abstraction for working with hardware:

Quantum: Schrodingers equations describe movement of electrons and holes through mate-rial.

Energy band: 2-dimensional diagrams that capture essential features of Schrodingers equa-tions. Energy-band diagrams are commonly used in nano-scale engineering.

Transistor: Signal values and time are continous (analog). Each transistor is modeled by aresistor-capacitor network. Overall behaviour is defined by differential equations in terms of

the resistors and capacitors. Spice is a typical simulation tool.

Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa-

tions are used, rather than differential equations. A rising edge may be modeled as a linearrise over some range of time, or the time between a definite low value and a definite high

value may be modeled as having an undefined or rising value.

Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discretevalues such as pureBoolean (0 or 1) or IEEEStandardLogic 1164, which has representations

for different types of unknown or undefined values. Time may be continuous or may be

discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate

has a delay of 1 and AND gate has a delay of 2).

3


13/304


14/304

6 CHAPTER 1. VHDL

numeric_bit defines arithmetic over bit vectors and integers. We wont use bit

signals in this course, so you dont need to worry about this package.

1.1.3 Semantics

The original goal of VHDL was to simulate circuits. The semantics of the language define circuit

behaviour.

a

b

c

simulationc


15/304


16/304


17/304

12 CHAPTER 1. VHDL

determine which parts of the library are externally visible

Use clause use a library in an entity/architecture or another package

technically, use clauses are part of entities and packages, but they proceed the entity/package

keyword, so we list them as top-level constructs

Entity (section 1.3.3)

define interface to circuit

Architecture (section 1.3.3)

define internal signals and gates of circuit

1.3.3 Entities and Architecture

Each hardware module is described with an Entity/Architecture pair

architecture

entity

architecture

entity

Figure 1.1: Entity and Architecture

Entity: interface names, modes (in / out), types of

externally visible signals of circuit

Architecture: internals

structure and behaviour of module

library ieee;use ieee.std_logic_1164.all;

entity and_or is

port (

a, b, c : in std_logic ;

z : out std_logic

);

end and_or;

Figure 1.2: Example of an entity

1.3.3 Entities and Architecture 13

The syntax of VHDL is defined using a variation on Backus-Naur forms (BNF).

[ { use_clause } ]entity ENTITYID is

[ port (

{ SIGNALID : (in | out) TYPEID [ := expr ] ; });

]

[ { declaration } ][ begin

{ concurrent_statement } ]end [ entity ] ENTITYID ;

Figure 1.3: Simplified grammar of entity

architecture main of and_or is

signal x : std_logic;

begin

x


18/304

14 CHAPTER 1. VHDL

1.3.4 Concurrent Statements

Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.6)

Concurrent statements make VHDL fundamentally different from most software languages.

Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hard-

ware.

At each infinitesimally small moment of time, each gate:

1. samples its inputs

2. computes the value of its output

3. drives the output

architecture main of bowser is

begin

x1


19/304

16 CHAPTER 1. VHDL

1.3.5 Component Declaration and Instantiations

There are two different syntaxes for component declaration and instantiation. The VHDL-93 syn-

tax is much more concise than the VHDL-87 syntax.

Not all tools support the VHDL-93 syntax. For E&CE 427, some of the tools that we use do not

support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax.

1.3.6 Processes

Processes are used to describe complex and potentially unsynthesizable behaviour

A process is a concurrent statement (Section 1.3.4).

The body of a process contains sequential statements (Section 1.3.7)

Processes are the most complex and difficult to understand part of VHDL (Sections 1.5 and 1.6)

process (a, b, c)

begin

y


20/304

18 CHAPTER 1. VHDL

1.3.8 A Few More Miscellaneous VHDL Features

Some constructs that are useful and will be described in later chapters and sections:

report : print a message on stderr while simulating

assert : assertions about behaviour of signals, very useful with report statements.

generics : parameters to an entity that are defined at elaboration time.

attributes : predefined functions for different datatypes. For example: high and low indices of a

vector.

1.4 Concurrent vs Sequential Statements

All concurrent assignments can be translated into sequential statements. But, not all sequential

statements can be translated into concurrent statements.

1.4.1 Concurrent Assignment vs Process

The two code fragments below have identical behaviour:

architecture main of tiny is

begin

b < = a ;

end main;

architecture main of tiny is

begin

process (a) begin

b

t < = ;

when =>

t < = ;

end case;

1.4.4 Coding Style

Code thats easy to write with sequential statements, but difficult with concurrent:

Sequential Statements

case is

when =>

if then

o < = ;

else

o < = ;

end if;

when =>

. . .

end case;

Concurrent Statements

Overall structure:with select

t


21/304

20 CHAPTER 1. VHDL

1.5 Overview of Processes

Processes are the most difficult VHDL construct to understand. This section gives an overview of

processes. Section 1.6 gives the details of the semantics of processes.

Within a process, statements are executed almost sequentially

Among processes, execution is done in parallel

Remember: a process is a concurrent statement!

entity ENTITYID is

interface declarations

end ENTITYID ;

architecture ARCHID of ENTITYID is

begin

concurrent statements =process begin

sequential statements =end process;

concurrent statements =end ARCHID;

Figure 1.11: Sequential statements in a process

Key concepts in VHDL semantics for processes: VHDL mimics hardware

Hardware (gates) execute in parallel

Processes execute in parallel with each other

All possible orders of executing processes must produce the same simulation results (wave-forms)

If a signal is not assigned a value, then it holds its previous value

All orders of executing concurrent statements must

produce the same waveforms

It doesnt matter whether you are running on a single-threaded operating system, on a multi-

threaded operating system, on a massively parallel supercomputer, or on a special hardware emu-

lator with one FPGA chip per VHDL process all simulations must be the same.

These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6)

and lead to the phenomenon of latch-inference (Section 1.5.2).

1.5. OVERVIEW OF PROCESSES 21

architecture

procA: process

stmtA1;

stmtA2;

stmtA3;

end process;

procB: process

stmtB1;

stmtB2;

end process;

execution sequence

A1

A2

A3

B1

B2

execution sequence

A1

A2

A3

B1

B2

execution sequence

A1

A2

A3

B1

B2

single threaded:procA before procB

single threaded:procB before procA

multithreaded: procA

and procB in parallel

Figure 1.12: Different process execution sequences

Figure 1.13: All execution orders must have same behaviour

Sections 1.5.11.5.3 discuss the hardware generated by processes.

Sections 1.61.6.5 discuss the behaviour and execution of processes.


22/304

22 CHAPTER 1. VHDL

1.5.1 Combinational Process vs Clocked Process

Each well-written synthesizable process is either combinational or clocked. Some synthesizable

processes that do not conform to our coding guidelines are both combintational and clocked. For

example, in a flip-flop with an asynchronous reset, the output is a combinational function of the

reset signal and a clocked function of the data input signal. We will deal with only with processes

that follow our coding conventions, and so we will continue to say that each process is either

combinational xor clocked.

Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry

A combinational processes must have a sensitivity list

A combinational process must not have any wait statements

A combinational process must not have any rising_edges, or falling_edges

The hardware for a combinational process is just combinational circuitry

Clocked process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Executing the process takes one (or more) clock cycles Target signals are outputs of flops

Process contains one or more wait or if rising edge statements

Hardware contains combinational circuitry and flip flops

Note: Clocked processes are sometimes called sequential processes,

but this can be easily confused with sequential statements, so in E&CE 427

well refer to synthesizable processes as either combinationalor clocked.

Example Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Combinational Process

process (a,b,c)

p1


23/304


24/304


25/304

28 CHAPTER 1. VHDL

1.6.2.4 Delta-Cycle Definitions

Definition simulation step: Executing one sequential assignment or process mode

change.

Definition simulation cycle: The operations that occur in one iteration of the simulation

algorithm.

Definition delta cycle: A simulation cycle that does not advance simulation time.

Equivalently: A simulation cycle with zero-delay assignments where the assignment

causes a process to resume.

Definition simulation round: A sequence of simulation cycles that all have the same

simulation time. Equivalently: a contiguous sequence of zero or more delta cycles

followed by a simulation cycle that increments time (i.e., the simulation cycle is not a

delta cycle).

Note: Official and unofficial terminology Simulation cycle and delta cycle

are official definitions in the VHDL Standard. Simulation step and simulation

round are not standard definitions. They are used in E&CE 427 because weneed words to associate with the concepts that they describe.

1.6.3 Example 1: Process Execution (Bamboozle) 29

1.6.3 Example 1: Process Execution (Bamboozle)

This example (Bamboozle) and the next example (Flummox, section 1.6.4) are very similar. The

VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The

stimulus for signals a and b also differs.

entity bamboozle is

begin

end bamboozle;

architecture main of bamboozle is

signal a, b, c, d : std_logic;

beginprocA : process (a, b) begin

c < = a A N D b ;

end process;

procB : process (b, c, d)

begin

d


26/304

30 CHAPTER 1. VHDL

Initial conditions (Shown in slides, not in notes)

Step 1(a): Activate procA(Shown in slides, not in notes)

a

b

c d

e

U

U

U UU

procA: process (a, b) begin

c


27/304

32 CHAPTER 1. VHDL

a

b

c d

e

U UU


c


28/304

34 CHAPTER 1. VHDL

Begin next simulation cycle (Shown in slides, not in notes)

Step 1(a): Activate procB (Shown in slides, not in notes)

Step 1(b): Provisional assignment to d (Shown in slides, not in notes)

Step 1(b): Provisional assignment to e (Shown in slides, not in notes)

Step 1(c): Suspend procB (Shown in slides, not in notes)

All processes suspended (Shown in slides, not in notes)

a

b

c d

e

0 UU


c


29/304

36 CHAPTER 1. VHDL

Begin next simulation cycle (Shown in slides, not in notes)

Step 1: No postponed processes (Shown in slides, not in notes)

a

b

c d

e


c


30/304

38 CHAPTER 1. VHDL

1.6.4 Example 2: Process Execution (Flummox)

This example is a variation of the Bamboozle example from section 1.6.3.

entity flummox is

begin

end flummox;

architecture main of flummox is

signal a, b, c, d : std_logic;

begin

proc1 : process (a, b, c) begin

c < = a A N D b ;d


31/304

40 CHAPTER 1. VHDL

Answer:

simulation step, delta cycle, simulation cycle, simulation round

Question: What is the order of granularity, from finest to coarsest, amongst the

different granularities related to delta-cycle simulation?

Answer:

Same order as listed just above. Note: delta cycles have a finer granularitythat simulation cycles, because delta cycles do not advance time, whilesimulation cycles that are not delta cycles do advance time.

1.6.5 Example: Need for Provisional Assignments

This is an example of processes where updating signals during a simulation cycle leads to different

results for different process execution orderings.

architecture main of swindle is

begin

p_c: process (a, b) begin

c < = a A N D b ;end process;

p_d: process (a, c) begin

d < = a X O R c ;

end process;

end main;

a

b

cd

Figure 1.18: Circuit to illustrate need for provisional assignments

1.6.5 Example: Need for Provisional Assignments 41

1. Start with all signals at 0.

2. Simultaneously change to a = 1 and b = 1.

. .

If assignments are not visible within same simulation cycle (correct: i.e. provisional

assignments are used)

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p c is scheduled before p d, then d will

have a 1 pulse.

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

Ifp d is scheduled before p c, then d will

have a 1 pulse.

. .

If assignments are visible within same simulation cycle (incorrect)

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p c is scheduled before p d, then d will

stay constant 0.

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

Ifp d is scheduled before p c, then d will

have a 1 pulse.

With provisional assignments, both orders of scheduling processes result in the same behaviour

on all signals. Without provisional assignments, different scheduling orders result in different

behaviour.


32/304

42 CHAPTER 1. VHDL

1.6.6 Delta-Cycle Simulations of Flip-Flops

This example illustrates the delta-cycle simulation of a flip-flop. Notice how the delta-cycle simu-lation captures the expected behaviour of the flip flop: the signal q changes at the same time (10ns)

as rising edge on the clock.

p_a : process begin

a


33/304

44 CHAPTER 1. VHDL

Testbenches and Clock Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

env : process begin

a


34/304

46 CHAPTER 1. VHDL

RTL Simulation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. Pre-processing

(a) Separate processes into combinational and non-combinational (clocked and timed)

(b) Decompose each combinational process into separate processes with one target signal

per process

(c) Sort processes into topological order based on dependencies

2. For each clock cycle or unit of time:

(a) Run non-combinational processes in any order. Non-combinational assignments read

from earlier clock cycle / time step.

(b) Run combinational processes in topological order. Combinational assignments read

from current clock cycle / time step.

1.7.2 Examples of RTL Simulation

Combinational Process Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

proc(a,b,c)

if a = 1 then

d < = b ;

e < = c ;

else

d


35/304

48 CHAPTER 1. VHDL

8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to

102ns.

9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to

102ns.

Question: Draw the RTL waveforms that correspond to the delta-cycle waveform

below.

a

b

c

d

e

proc1

proc2

proc3

delta cycle

sim cycle

sim round B

B

BP

P

P

U

U

U

U

U

A

U

S

A

1

0

S

A S

U

U

E

E

P

P

A

0

U

S

A S

B

B E

E

P A S

0

1

B

B E

E

P A S

0

B E

E

P A S

1

P

P A S

1

A S

1

1

B

B

B

E

EP A S

1

0

P A S

0

102ns

0

B

BE

E E

E

E

B

B

0ns 3ns

BE

E

U

0ns+1 0ns+2 0ns+2 3ns+1 3ns+2 3ns+3

Answer:

a

b

c

d

e

U

U

U

U

U

1

0

0

1

0

1

1

0

0ns 1ns 2ns 3ns 102ns

1.7.2 Examples of RTL Simulation 49

Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Note: It is easier to do a simulation by hand if you start your clock at 0

and use the first clock phase in the waveform diagram for the first values that

your VHDL code ass igns t o si gnals

Simulate If-Then-Else, Wait Until . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

huey: process

begin

clk


36/304

50 CHAPTER 1. VHDL

A Related Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Small changes to the code can cause significant changes to the behaviour.riri: process

begin

clk


37/304


38/304

54 CHAPTER 1. VHDL

1.8.3.3 Flops with Chip-Enable

The two code fragments below synthesize to identical hardware (flops with chip-enable lines).

If

process (clk)

begin

if rising_edge(clk) then

if (ce = 1) then

q


39/304

56 CHAPTER 1. VHDL

(a) Flops use if statements

(b) Flops use wait statements

Some examples of these different options are shown in figures1.211.24.

S

R

S

R

sel reset

clk

c

a

entity and_not_reg is

port (

reset,

clk,

s el : in st d_ lo gi c;

c : out std_logic

);

end;

Schematic and entity for examples of different code organizations in Figures1.211.24

Figure 1.20: Schematic and entity for and not reg

One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

architecture one_proc of and_not_reg is

signal a : std_logic;

begin

process begin

wait until rising_edge(clk);

if (reset = 1) then

a


40/304

58 CHAPTER 1. VHDL

Two Processes with If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

architecture two_proc_if of and_not_reg is

signal a : std_logic;

begin

process (clk)

begin


if (reset = 1) then

a


41/304


42/304

62 CHAPTER 1. VHDL

1.10.4 Different Widths and Arithmetic

Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)

target src1/2 src2/1

narrow wide fails in elaboration

wide narrow int fails in elaboration

wide wide OK

narrow narrow narrow OK

narrow narrow int OK

Example vectorswide unsigned(7 downto 0)

narrow unsigned(4 downto 0)

1.10.5 Overloading of Comparisons

Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, =, >,


43/304

66 CHAPTER 1 VHDL 1 11 1 U th i bl C d 67


44/304

66 CHAPTER 1. VHDL

1.11.1.4 Multiple if rising edges in Same Process

Multiple if rising edge statements in a process (UNSYNTHESIZABLE)

process (clk)

begin


q0


45/304

68 CHAPTER 1. VHDL

Synthesizable Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-

edge outside of the for loop.

process (clk) begin


f or i in 0 to 7 lo op

q(i)


46/304


47/304


48/304

76 CHAPTER 1. VHDL P1.2 VHDL Syntax 77


49/304

1.13 VHDL Problems

P1.1 IEEE 1164

For each of thevalues in thelist below, answer whether or notit is defined in theieee.std_logic_1164

library. If it is part of the library, write a 23 word description of the value.

Values: -, #, 0, 1, A, h, H, L, Q, X, Z.

P1.2 VHDL Syntax

Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.

NOTES: 1) ... represents a fragment of legal VHDL code.

2) For full marks, if the code is illegal, you must explain why.

3) The code has been written so that, if it is illegal, then it is illegal for both

simulation and synthesis.

q2a

architecture main of anchiceratops is

signal a, b, c : std_logic;begin

process begin

wait until rising_edge(c);

a p, b => q);

...

end main;

q2e

architecture main of pachyderm is

function inv(a : std_logic)

return std_logic is

begin

return(NOT a);

end inv;

signal p, b : std_logic;

begin

p a);

...

end main;

q2f

architecture main of apatosaurus istype state_ty is (S0, S1, S2);

signal st : state_ty;

signal p : std_logic;

begin

case st is

when S0 | S1 => p p


50/304

P1.3 Flops, Latches, and Combinational Circuitry

For each of the signals p...z in the architecture main ofmontevido, answer whether the signalis a latch, combinational gate, or flip-flop.

entity montevido is

port (

a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;

l : in std_logic_vector (1 downto 0);

p, q, r, s, t, u, v, w, x, y, z : out std_logic

);

end montevido;

architecture main of montevido issignal i, j : std_logic;

begin

i


51/304

entity bigckt is

port (

a, b : in std_logic;

c : out std_logic

);

end bigckt;

architecture main of bigckt is

beginprocess (a, b)

begin

if (a = 0) then

c


52/304

P1.6 Delta-Cycle Simulation: Pong

Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram.

INSTRUCTIONS:

1. The simulation is to be done at the granularity of simulation-steps.

2. Show all changes to process modes and signal values.

3. Each column of the timing diagram corresponds to a simulation step that changes a signal or

process.

4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation

round by writing in the appropriate row a B at the beginning and an E at the end of the cycle

or round.5. End your simulation just before 20 ns.

architecture main of pong_machine is

signal ping_i, ping_n, pong_i, pong_n : std_logic;

begin

reset_proc: process

reset


53/304

P1.8 Clock-Cycle Simulation

Given the VHDL code for anapurna and waveform diagram below, answer what the values ofthe signals y, z, and p will be at the given times.

entity anapurna is

port (

clk, reset, sel : in std_logic;

a, b : in unsigned(15 downto 0);

p : out unsigned(15 downto 0)

);

end anapurna;

architecture main of anapurna is

type state_ty is (mango, guava, durian, papaya);

signal y, z : unsigned(15 downto 0);

signal state : state_ty;

begin

proc_herzog: process

begin

top_loop: loop

wait until (rising_edge(clk));

next top_loop when (reset = 1);

state


54/304

P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega

For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviouras it does in the main architecture ofichthyostega?

NOTES: 1) For full marks, if the code has different behaviour, you must explain

why.

2) Ignore any differences in behaviour in the first few clock cycles that is

caused by initialization of flip-flops, latches, and registers.

3) All code fragments in this question are legal, synthesizable VHDL code.

entity ichthyostega is

port (

clk : in std_logic;

b, c : in signed(3 downto 0);

v : out sig ne d( 3 d own to 0)

);

end ichthyostega;

architecture main of ichthyostega is

signal bx, cx : signed(3 downto 0);

begin

process begin

wait until (rising_edge(clk));bx


55/304

P1.11 Waveform VHDL Behavioural Comparison

Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour asthe timing diagram.

NOTES: 1) Same behaviour means that the signals a, b, and c have the same values at

the end of each clock cycle in steady-state simulation (ignore any irregularities

in the first few clock cycles).

2) For full marks, if the code does not match, you must explain why.

3) Assume that all signals, constants, variables, types, etc are properly defined

and declared.

4) All of the code fragments are legal, synthesizable VHDL code.

clk

a

b

c

q3aarchitecture q3a of q3 is

begin

process begina


56/304

P1.12 Hardware VHDL Comparison

For each of the circuits q2aq2d, answer

whether the signal d has the same behaviour

as it does in the main architecture of q2.

entity q2 is

port (

a, clk, reset : in std_logic;

d : out std_logic

);

end q2;

architecture main of q2 is

signal b, c : std_logic;

begin

b < = 0 whe n (r es et = 1 )

else a;

process (clk) begin


c < = b ;

d < = c ;

end if;

end process;

end main;

q2a clk

a

0

reset

d

q2b clk

a

0

reset

d

q2c clk

a

0

reset

d

q2d clk

a

0

reset

d

clk

P1.13 8-Bit Register

Implement an 8-bit register that has: clock signal clk

input data vector d

output data vector q

synchronous active-high input reset

synchronous active-high input enable

P1.13.1 Asynchronous Reset

Modify your design so that the reset signal is asynchronous, rather than synchronous.

P1.13.2 Discussion

Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on

an FPGA.

P1.13.3 Testbench for Register

Write a test bench to validate the functionality of the 8-bit register with synchronous reset.

92 CHAPTER 1. VHDL P1.14 Synthesizable VHDL and Hardware 93


57/304

P1.14 Synthesizable VHDL and Hardware

For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If thecode is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of

the code. If the the code is not synthesizable, explain why.

q4a

process begin

wait until rising_edge(a);

e < = d ;

wait until rising_edge(b);

e


58/304

P1.15 Datapath Design

Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit.The circuit is intended to perform the following sequence of operations (not all operations are

required to use a clock cycle):

read in source and destination addresses from i src1,i src2, i dst

read operands op1 and op2 from memory

compute sum of operands sum

write sum to memory at destination address dst

write sum to output o result

i_src1

i_src2

i_dst

o_result

clk

P1.15.1 Correct Implementation?

For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation

of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in

which cycle you need load=1.

NOTES:1. You may choose the number of clock cycles required to execute the sequence of operations.

2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0.

3. The control circuitry that controls the datapath will output a signal load, which will be 1when the sum is to be written into memory.

4. The code fragment with the signal declaractions, connections for inputs and outputs, and the

instantiation of memory is to be used for all three code fragments q4aq4c.

5. The memory has registered inputs and combinational (unregistered) outputs.

6. All of the VHDL is legal, synthesizable code.

-- This code is to be used for

-- all three code fragments q4a--q4c.

signal state : std_logic_vector(3 downto 0);

signal src1, src2, dst, op1, op2, sum,mem_in_a, mem_out_a, mem_out_b,

mem_addr_a, mem_addr_b

: unsigned(7 downto 0);

...

process (clk)

begin


src1 mem_we,

i_data_a => mem_in_a,

o_data_a => mem_out_a,

o_data_b => mem_out_b);

96 CHAPTER 1. VHDL P1.15 Datapath Design 97


59/304

q4a

op1 0);op2 0);

sum 0);

mem_in_a 0);

mem_addr_a


60/304

Chapter 2

RTL Design with VHDL: From

Requirements to Optimized Code

2.1 Prelude to Chapter

2.1.1 A Note on EDA for FPGAs and ASICs

The following is from John Cooleys column The Industry Gadfly from 2003/04/30. The title of

this article is: The FPGA EDA Slums.

For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the

FPGA market was US$2.6 billion.

Whats more interesting is that the 2001 ASIC EDA market was US$2.2 billion while

the FPGA EDA market was US$91.1 million. Nope, thats not a mistake. Its ASIC

EDA and billion versus FPGA EDA and million. Do the math and youll see that for

every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor.

For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor.

Not good.

Its the old free milk and a cow story according to Gary Smith, the Senior EDA

Analyst at Dataquest. Altera and Xilinx have fowled their own nest. Their free tools

spoil the FPGA EDA market, says Gary. EDA vendors know that theres no money

to be made in FPGA tools.

99

100 CHAPTER 2. RTL DESIGN WITH VHDL

2 2 FPGA B k d d C di G id li

2.2.2 Area Estimation 101


61/304

2.2 FPGA Background and Coding Guidelines

2.2.1 Generic FPGA Hardware

2.2.1.1 Generic FPGA Cell

Cell = Logic Element (LE) in Altera

= Configurable Logic Block (CLB) in Xilinx

CE

S

RD Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in

2.2.2 Area Estimation

We estimate the number of FPGA cells required for a design by counting the number of flip-

flops and primary inputs that are in the fanin of each flip-flop. Only flip-flops count, because

combinational signals are collapsed into the circuity within an FPGA cell. The circuitry for any

flip-flop signal with up to four source flip-flops can be implemented on a single FPGA cell. If a

flip-flop signal is dependent upon five source flip-flops, then two FPGA cells are required.

Source flops/inputs Minimum cells

1 1

2 1

3 1

4 1

5 2

6 2

7 2

8 3

9 3

10 3

11 4

For a single target signal, this technique gives a lower bound on the number of cells needed. For

example, some functions of seven inputs require more than two cells. As a particular example, a

four-to-one multiplexer has six inputs and requires three cells.

When dealing with multiple target signals, this technique might be an overestimate, because a

single cell can drive several other cells (common subexpression elimination).

PLA and Flop for Different Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CE

S

RD Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in

PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CE

S

RD Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


PLA d Fl f S F ti


E ti t A f Ci it


62/304

PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CE

S

RD Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in

Estimate Area for Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Question: Map the combinational circuits below onto generic FPGA cells.

a

b

c

d

zCE

S

RD Q

comb

abcd

z

a

b

c

dz y

xe

f

g

h

i

CE

S

RD Q

comb

CE

S

RD Q

comb

xz

y

zy

abcd

a

b

c

dz

w

xe

f

g

h

i

y

CE

S

RD Q

comb

CE

S

RD Q

comb

CE

S

RD Q

comb

xz

y

zy

abcd

bcd

w


2 2 2 1 Interconnect for Generic FPGA



63/304

2.2.2.1 Interconnect for Generic FPGA

Note: In these slides, the space between tightly grouped wires sometimes

disappears, making a group of wires appear to be a single large wire.

There are two types of wires that connect a cell to the rest of the chip:

General purpose interconnect (configurable, slow)

Carry chains and cascade chains (verticaly adjacent cells, fast)

2.2.2.2 Blocks of Cells for Generic FPGA

Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within

a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks

might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested

for-generate statements that replicate a single component (cell) hundreds of thousands of

times.

Cells not used for computation can be used as wires to shorten length of path between cells.


64/304


2.2.4 Altera APEX20K Information and Coding Guidelines

2.3. DESIGN FLOW 109

2.3 Design Flow


65/304

2.2.4 Altera APEX20K Information and Coding Guidelines

APEX20K Block Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chip

52 Mega Logic Array Blocks (MegaLABs)

1 Embedded System Block (ESB)

Memory and wide combinational

functions

16 Logic Array Blocks (LABs)

10 Logic Elements (LEs)

4-input lookup table

Carry and cascadeFlip-flop

Each level of hierarchy has its own interconnect (wires).

LE Computation and Storage . . . . . . . . .

4-input lookup table (LUT)

Carry-chain computation circuitry

Cascade-chain computation circuitry

Flip-flop with load, clear, clock-enable

LE Interconnect . . . . . . . . . . . . . . . . . . . . . .

4 data inputs 2 data outputs

Carry in, carry out

Cascade in, cascade out

Clock, clock-enable

Async clear, synch set (load), synch clear(reset)

Global reset

Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Altera APEX20K chips initialize all flip flops to 0 at startup. To mimic this behaviour in

simulation, you should put an initial value of 0 on all flip flops. If you are doing your own

encoding for a state machine, choose the reset state to be encoded as all zeroes.

You should not put initial values on inputs or combinational signals.

2.3 Design Flow

2.3.1 Generic Design Flow

Most people agree on the general terminology and process for a digital hardware design flow.

However, each book and course has its own particular way of presenting the ideas. Here we will

lay out the consistent set of definitions that we will use in E&CE 427. This might be different from

what you have seen in other courses or on a work term. Focus on the ideas and you will be fine

both now and in the future.

The design flow presented here focuses on the artifacts that we work with, rather than the opera-

tions that are performed on the artifacts. This is because the same operations can be performed at

different points in the design flow, while the artifacts each have a unique purpose.

Analyze

Modify

Analyze

Modify

Analyze

Modify

Analyze

Modify

Analyze

Modify

Requirements

Opt. RTL Code

Implementation

Hardware

DP+Ctrl Code

High-Level Model

dp/ctrl

specific

Algorithm

Figure 2.1: Generic Design Flow


66/304


Storage

2.4. ALGORITHMS AND HIGH-LEVEL MODELS 113

2.3.3.3 Control-Centric Design Flow


67/304

Purpose: hold data for future use

Data is not modified while stored

Examples: register files, FIFO queues

Control

Purpose: modify internal state based on inputs, compute outputs from state and inputs

Mostly individual signals, few data (vectors)

Examples: bus arbiters, memory-controllers

All three classes of circuits (datapath, control, and storage) follow the same generic design flow

(Figure2.1) and use dataflow diagrams, hardware block diagrams, and state machines. The differ-

ences in the design flows appear in the relative amount of effort spent on each type of description

and the order in which the different descriptions are used. The differences are most pronouncedin the transition from the high-level model to the model that separates the datapath and control

circuitry.

2.3.3.2 Datapath-Centric Design Flow

Analyze

Modify

Analyze

Modify

Block Diagram State Machine

High-Level Model

Dataflow

DP+Ctrl RTL Code

Figure 2.2: Datapath-Centric Design Flow

Analyze

Modify

Analyze

Modify

Analyze

Modify

High-Level Model

State Machine

Dataflow Diagram

Block Diagram

DP+Ctrl RTL Code

Figure 2.3: Control-Centric Design Flow

2.3.3.4 Storage-Centric Design Flow

In E&CE 427, we wont be discussing storage-centric design. Storage-centric design differs from

datapath- and control-centric design in that storage-centric design focusses on building many repli-

cated copies of small cells.

Storage-centric designs include a wide range of circuits, from simple memory arrays to compli-

cated circuits such as register files, translation lookaside buffers, and caches. The complicated

circuits can contain large and very intricate state machines, which would benefit from some of the

techniques for control-centric circuits.

2.4 Algorithms and High-Level Models

For designs with significant control flow, algorithms can be described in software languages, flow-

charts, abstract state machines, algorithmic state machines, etc.

For designs with trivial control flow (e.g. every parcel of input data undergoes the same computa-

tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm.

For designs with a small amount of control flow (e.g. a microprocessor, where a single decision is

made based upon the opcode) a set of data-dependency graphs is often a good choice.


Software executes in series;

2.4.3 High-Level Models 115

2.4.3 High-Level Models


68/304

;hardware executes in parallel

When creating an algorithmic description of your hardware design, think about how you can repre-

sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism

to improve the performance of your design.

2.4.1 Flow Charts and State Machines

Flow charts and various flavours of state machines are covered well in many courses. Generally

everything that youve learned about these forms of description are also applicable in hardware

design.

In addition, you can exploit parallelism in state machine design to create communicating finite state

machines. A single complex state machine can be factored into multiplesimple state machines that

operate in parallel and communicate with each other.

2.4.2 Data-Dependency Graphs

In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount

of time to execute as: ( a + b ) + ( c + d ) + ( e + f ) .

But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guideparallel vs serial execution.

Datadependency graphs capture algorithms of datapath-centric designs.

Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes

the same computation.

Serial Parallel

(((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)a b c d e f

+

+

+

+

+

a b c d e f

+

+

+

+

+

5 adders on longest path (slower) 3 adders on longest path (faster)

5 adders used (equal area) 5 adders used (equal area)

There are many different types of high-level models, depending upon the purpose of the model

and the characteristics of the design that the model describes. Some models may capture power

consumption, others performance, others data functionality.

High-level models are used to estimate the most important design metrics very early in the design

cycle. If power consumption is more important that performance, then you might write high-

level models that can predict the power consumption of different design choices, but which has

no information about the number of clock cycles that a computation takes, or which predicts the

latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate

high-level models that do not contain any information about power consumption.

Conventionally, performance has been the primary design metric. Hence, high-level models that

predict performance are more prevalent and more well understood than other types of high-levelmodels. There are many research and entrepreneurial opportunities for people who can develop

tools and/or languages for high-level models for estimating power, area, maximum clock speed,

etc.

In E&CE 427 we will limit ourselves to the well-understood area of high-level models for perfor-

mance prediction.


69/304


As with all topics in E&CE 427, there are tradeoffs between these different styles of writing state

machines Most books teach only the explicit current+next style This style is the style closest to

2.5.2 Implementing a Simple Moore Machine 119

2.5.2.1 Implicit Moore State Machine


70/304

machines. Most books teach only the explicit-current+next style. This style is the style closest to

the hardware, which means that they are more amenable to optimization through human interven-

tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style isthat they are concise and readable for control flows consisting of nested loops and branches (e.g.

the type of control flow that appears in software). For control flows that have less structure, it

can be difficult to write an implicit state machine. Very few books or synthesis manuals describe

multiple-wait statement processes, but they are relatively well supported among synthesis tools.

Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difficult to

write some state machines with complicated control flows in an implicit style. The following

example illustrates the point.

s0/0

s1/1

s2/0

s3/0

a

!a

!a

a

Note: The terminology of explicit and implicit is somewhat standard,

in that some descriptions of processes with multiple wait statements describe

the processes as having implicit state machines.

There is no standard terminology to distinguish between the two explicit styles:

explicit-current+next and explicit-current.

2.5.2 Implementing a Simple Moore Machine

s0/0

s1/1 s2/0

s3/0

a !aentity simple is

port (

a, clk : in std_logic;z : out std_logic

);

end simple;

architecture moore_implicit of simple is

beginprocess

begin

z


71/304

architecture moore_explicit_v1 of simple is

type state_ty is (s0, s1, s2, s3);signal state : state_ty;

begin

process (clk)

begin


case state is

when s0 =>

if (a = 1) then

state


72/304

architecture moore_explicit_v3 of simple is

type state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;

begin

process (clk)

begin


state


73/304

Mealy machines have a combinational path from inputs to outputs, which often violates good

coding guidelines for hardware. Thus, Moore machines are much more common. You shouldknow how to write a Mealy machine if needed, but most of the state machines that you design will

be Moore machines.

This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine

is the same as the Moore machine, except for the timing relationship between the output ( z) and

the input (a).

s0

s1 s2

s3

a/1 !a/0

/0/0

entity simple isport (

a, clk : in std_logic;

z : out std_logic

);

end simple;

Note: An implicit Mealy state machine is nonsensical.

In an implicit state machine, we do not have a state signal. But, as the example below illustrates,

to create a Mealy state machine we must have a state signal.

An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen-

dent upon the input in the current clock cycle, the output cannot be a flop. For the output to be

combinational and dependent upon both the current state and the current input, we must create a

state signal that we can read in the assignment to the output. Creating a state signal obviates the

advantages of using an implicit style of state machine.

architecture implicit_mealy of simple is

type state_ty is (s0, s1, s2, s3);

signal state : state_ty;

begin

process

begin

state


74/304

architecture mealy_explicit of simple is

type state_ty is (s0, s1, s2, s3);signal state : state_ty;

begin

process (clk)

begin


case state is

when s0 =>

if (a = 1) then

state


75/304

All circuits should have a reset signal that puts the circuit back into a good initial state. However,

not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a statemachine, the state machine will probably need to be reset, but datapath may not need to be reset.

There are standard ways to add a reset signal to both explicit and implicit state machines.

It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or

your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.

Reset with Implicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

With an implicit state machine, we need to insert a loop in the process and test for reset after each

wait statement.

Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold.

architecture moore_implicit of simple is

begin

process

begin

init : loop -- outermost loop

z


76/304


Tradeoffs in Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Gray is good for low power applications where consecutivedata values typically differ by 1 (e g

2.6.1 Dataflow Diagrams Overview 133

a b c d e f

+


77/304

Gray is good for low-power applications where consecutivedata values typically differ by 1 (e.g.

no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up

to a dozen or so states. With more than a dozen states, the extra flip-flops required by one-hot

encoding become too expensive.

Custom is great if you have lots of time and are incredibly intelligent, or have deep insight intothe guts of your design.

Note: Dont care values When we dont care what is the value of a signal we

assign the signal -, which is dont care in VHDL. Thi s should allow the

synthesis tool to use whatever value is most helpful in simplifying the Boolean

equations for the signal (e.g. Karnaugh maps). In the past, some groups in

E&CE 427 have used- quite succesfuly to decrease the area of their design.However, a few groups fou nd that using - increasedthe size of their design,

when they were expecting it to decrease the size. So, if you are tweaking your

design to squeeze out the last few unneeded FPGA cells, pay close attention as

to whether using - hurts or helps.

2.6 Dataflow Diagrams

2.6.1 Dataflow Diagrams Overview

Dataflow diagrams are data-dependency graphs where the computation is divided into clockcycles.

Purpose:

Provide a disciplined approach for designing datapath-centric circuits

Guide the design from algorithm, through high-level models, and finally to register transfer

level code for the datapath and control circuitry.

Estimate area and performance

Make tradeoffs between different design options

Background Based on techniques from high-level synthesis tools

Some similarity between high-level synthesis and software compilation

Each dataflow diagram corresponds to a basic block in software compiler terminology.

+

+

+

+

+

x1

x2

x3

x4

z

Data-dependency graph for z = a + b + c + d + e + f

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Dataflow diagram for z = a + b + c + d + e + f


a b c d e f

+

2.6.2 Dataflow Diagrams, Hardware, and Behaviour 135

2.6.2 Dataflow Diagrams, Hardware, and Behaviour

Primary Input


78/304

+

+

+

+

+

x1

x2

x3

x4

z

Horizontal lines markclock cycle boundaries

The use of memory arrays in dataflow diagrams is described in section 2.7.4.

Primary Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataflow Diagram

i

x

Hardware

i x

Behaviourclk

i

x

Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataflow Diagram

i

x

Hardwarei

x

Behaviourclk

i

x

Register Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataflow Diagram

i1

x

+

i2

Hardware

i2

xi1

+

Behaviourclk

i1

i2

x

Combinational-Component Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataflow Diagram

i1

x

+

i2Hardware

i2

i1+ x

Behaviourclk

i1

i2

x


2.6.3 Area Estimation

Maximum number ofblocks in a clock cycle is total number of that component that are needed

2.6.4 Dataflow Diagram Execution 137

2.6.4 Dataflow Diagram Execution

Execution with Registers on Both Inputs and Outputs


79/304

Maximum number ofsignals that cross a cycle boundary is total number ofregisters that areneeded

Maximum number ofunconnected signal tails in a clock cycle is total number of inputs thatare needed

Maximum number of unconnected signal heads in a clock cycle is total number of outputsthat are needed

The information above is only for estimating the number of components that are needed. In fact,

these estimates give lower bounds. There might be constraints on your design that will force you

to use more components (e.g., you might need to read all of your inputs at the same time).

Implementation-technologyfactors, suchas the relativesize of registers, multiplexers, and datapath

components, might force you to make tradeoffs that increase the number of datapath componentsto decrease the overall area of the circuit.

Of particular relevance to FPGAs:

With some FPGA chips, a 2:1 multiplexer has the same area as an adder.

With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cellper bit.

In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by theamount of combinational logic, not the number of flip-flops.

In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and

registers are quite expensive in area.

Execution with Registers on Both Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

6

0 1 2 3 4 5 6

x5

Execution Without Output Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

0 1 2 3 4 5 6

x5


2.6.5 Performance Estimation

Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.7 Area / Performance Tradeoffs 139

2.6.7 Area / Performance Tradeoffs

one add per clock cycle two adds per clock cycle


80/304

Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Performance 1

TimeExec

TimeExec = Latency ClockPeriod

Latency = Number of clock cycles from inputs to outputs

There is much more information on performance in chapter4, which is devoted to performance.

Performance of Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Latency: count horizontal lines in diagram

Min clock period (Max clock speed) limited by longest path in a clock cycle

2.6.6 Design Analysis

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

num inputs 6

num outputs 1

num registers 6

num adders 1

min clock period delay through flop and one adder

latency 6 clock cycles

p y p y

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

5

6x5

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

x5

Note: In the Two-add design, half of the last clock cycle is wasted.

Two Adds per Clock Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

clk

a

x1

x2

x3

x4

x5

z

0 1 2 3 4 5 6

4

x5


Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

One add per clock cycle Two adds per clock cycle

2.7. MEMORY ARRAYS AND RTL DESIGN 141

2.7 Memory Arrays and RTL Design


81/304

One add per clock cycle Two adds per clock cycle

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

5

6

x5

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

x5

inputs 6 6

outputs 1 1

registers 6 6

adders 1 2

clock period flop + 1 add flop + 2 add

latency 6 4

Question: Under what circumstances would each design option be fastest?

Answer:

time = latency * clock period

compare execution times for both options

T1 = 6 (Tf + Ta)T2 = 4 (Tf + 2 Ta)

One-add will be faster whenT1 < T2:

6 (Tf + Ta) < 4 (Tf+ 2 Ta)6Tf + 6Ta < 4Tf + 8Ta

2Tf < 2TaTf < Ta

Sanity check: If add is slower than flop, then want to minimize the number ofadds. One-add has fewer adds, so one-add will be faster when add is slowerthan flop.

2.7.1 Memory OperationsRead of Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataflow DiagramM

d

mem(rd)

aHardware

WE

A

DI

DOa doM

clk

we

Behaviour

clk

a

d

a

M(a)

d

we

do

-

-

Write to Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataflow DiagramM

M

mem(wr)

adiHardware

WE

A

DI

DOaM

clk

di

we

do

Behaviour

clk

a

d

a

M(a)

d

we

di

-

-

-

do U

-

-

Dual-Port Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

M

M

mem(wr)

a0di0

mem(rd)

a1

do1

a0M

clk

di0

we WE

A0

DI0

DO0

A1 DO1a1 do1

do0

clk

a

d

a0

M(a)

d

we

di0

-

-

-

-

aa1

do0

-

-

dM(a)

U

ddo1 -


Sequence of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

clk

we -

2.7.2 Memory Arrays in VHDL 143

architecture main of mem_not_hw is

subtype data is std_logic_vector(7 downto 0);

type data_vector is array( natural range ) of data;

signal mem : data vector(31 downto 0);


82/304

M

M

mem(wr)

a0

di0

mem(rd)

a1

do1

mem(rd)

do1

mem(rd)

do0

a1

a0 a

d1

a0

M(a)

ddi0

aa1

do0

dM(a)

ddo1 -

a

d2

a

-

-

-

d1

d

dM(a) -

dM(a)

?

2.7.2 Memory Arrays in VHDL

2.7.2.1 Using a Two-Dimensional Array for Memory

A memory array can be written in VHDL as a two-dimensional array:

subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range ) of data;

signal mem : data_vector(31 downto 0);

These two-dimensional arrays can be useful in high-level models and in specifications. However,

it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some

synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two-

dimensional arrays very inefficiently.

The example below illustrates: lack of interface protocol, combinational write, multiple write

ports, multiple read ports.

g _ ( );

begin

y


83/304

subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range ) of data;

end;

entity mem is

port (

clk : in std_logic;

we : in std_logic -- write enable

a : i n u ns ig ne d( 4 d ow nt o 0) ; - - ad dr es s

di : in data; -- data_in

do : out data -- data_out

);

end mem;

architecture main of mem is

signal mem : data_vector(31 downto 0);

begin

do


84/304

needs, you can construct your own component from smaller ones.

WE

A

DI

DO

WE

A

DI

DO

NxW NxW

WriteEn

Addr

DataIn[W-1..0]DataIn[2W-1..2]

Clk

DataOut[W-1..0]DataOut[2W-1..W]

Figure 2.4: An N2W memory from NW components

WE

A

DI

DO

WE

A

DI

DO

NxW

NxW

WriteEn

Addr[logN-1..0]

DataIn

Clk

DataOut

Addr[logN]

10

Figure 2.5: A 2NW memory from NW components

use ieee.std_logic_1164.all;use ieee.numeric_std.all;

entity ram16x4s is

port (

clk, we : in std_logic;

data_in : in std_logic_vector(3 downto 0);

a ddr : i n u ns ig ne d( 3 d ow nt o 0) ;

data_out : out std_logic_vector(3 downto 0)

);

end ram16x4s;

architecture main of ram16x4s is

component ram16x1s

port (d : in std_logic; -- data in

a3, a2, a1, a0 : in std_logic; -- address

we : in std_logic; -- write enable

wclk : in std_logic; -- write clock

o : out std_logic -- data out

);

end component;

begin

mem_gen:

for i in 0 to 3 generate

ram : ram16x1s

port map (

we => we,

wclk => clk,

----------------------------------------------

-- d and o are dependent on i

a3 => addr(3), a2 => addr(2),

a1 => addr(1), a0 => addr(0),

d => data_in(i),

o => data_out(i)

----------------------------------------------

);

end generate;

end main;


2.7.2.6 Dual-Ported Memory

Dual ported memory is similar to single ported memory, except that it allows two simultaneous

2.7.3 Data Dependencies 149

Purpose of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R3 := ......W0


85/304

reads, or a simultaneous read and write.

When doing a simultaneous read and write to the same address, the read will usually not see the

data currently being written.

Question: Why do dual-ported memories usually not support writes on both ports?

Answer:

What should your memory do if you write different values to the same

address in the same clock cycle?

2.7.3 Data Dependencies

Definition of Three Types of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

There are three types of data dependencies. The names come from pipeline terminology in com-

puter architecture.

M[i] :=

:= M[i]

:=

M[i]

:=

:=

M[i]

:=

M[i]

:=

:=

M[i]

:=

Read after Write Write after Write Write after Read

(True dependency) (Load dependency) (Anti dependency)

Instructions in a program can be reordered, so long as the data dependencies are preserved.

R3 := ......

... := ... R3 ...

producer

consumer

W1

R1

W2

WAW ordering prevents W0

from happening after W1

WAR ordering prevents W2

from happening before R1

RAW ordering prevents R1

from happening before W1

R3 := ......

Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specific purpose

in ensuring that producer-consumer relationships are preserved.

Ordering of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3] M[2] M[1] M[0]

30 20 10 0

M[3]C :=

21

Initial Program with Dependencies

M[2] := 21

M[3] 31:=

A := M[2]

B := M[0]

M[3] 32:=

M[0] 01:=

C := M[3]

Valid Modification

M[2] := 21

M[3] 31:=

A := M[2]

B := M[0]

M[3] 32:=

M[0] 01:=

C := M[3]

Valid (or Bad?) Modification

Answer:

Bad modification: M[3] := 32 must happen before C := M[3].


2.7.4 Memory Arrays and Dataflow Diagrams

Legend for Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7.4 Memory Arrays and Dataflow Diagrams 151

Dataflow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


86/304

name

name name name (rd) name(wr)

Input port Output port State signal Array read Array write

Basic Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

mem(rd)

addr

data

mem

mem

(anti-dependency)

mem(wr)

data addrmem

mem

data := mem[addr]; mem[addr] := data;

Memory Read Memory Write

Dataflow diagrams show the dependencies between operations. The basic memory operations are

similar, in that each arrow represents a data dependency.

There are a few aspects of the basic memory operations that are potentially surprising:

The anti-dependency arrow producing mem on a read.

Reads and writes are dependent upon the entire previous value of the memory array.

The write operation appears to produce an entire memory array, rather than just updating anindividual element of an existing array.

Normally, we think of a memory array as stationary. To do a read, an address is given to the array

and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to

see the read and write operations consuming and producing memory arrays.

Our goal is to support memory operations in dataflow diagrams. We want to model memory oper-ations similarly to datapath operations. When we do a read, the data that is produced is dependent

upon the contents of the memory array and the address. For write operations, the apparent depen-

dency on, and production of, an entire memory array is because we do not know which address

in the array will be read from or written to. The antidependency for memory reads is related to

Write-after-Read dependencies, as discussed in Section 2.7.3. There are optimizations that can be

performed when we know the address (Section 2.7.4).

Algo: mem[wr addr] := data in;data out := mem[rd addr];

data_out

mem(wr)

data_in wr_addr

rd_addr

mem

mem(rd)

mem

Read after Write

Algo: mem[wr addr] := data in;data out := mem[rd addr];

data_out

mem(wr)

data_in wr_addr

rd_addr

mem

mem(rd)

mem

Optimization when rd addr = wr addr

Algo: mem[wr1 addr] := data1;

mem[wr2 addr] := data2;

mem(wr)

mem

mem(wr)

data1 wr1_addr

wr2_addr

mem

data2

Write after Write


Algo: mem[wr1 addr] := data1;

mem[wr2 addr] := data2;

wr2_addrdata2mem

2.7.5 Example: Memory Array and Dataflow Diagram 153

2.7.5 Example: Memory Array and Dataflow Diagram

data_in wr_addrmem


87/304

mem(wr)

mem(wr)

data1 wr1_addr

mem

Scheduling option when

wr1 addr = wr2 addr

Algo: rd data := mem[rd addr];

mem[wr addr] := wr data;

mem(wr)

mem

mem(rd)

rd_addr

wr_addr

mem

wr_data

rd_data

Write after Read

Algo: rd data := mem[rd addr];

mem[wr addr] := wr data;

mem(wr)

mem

mem(rd)

rd_addr wr_addr

mem

wr_data

rd_data

Optimization when rd addr = wr addr

M(wr)

2

M(rd)

M 21 2

M(wr)

31 3

A

0

M(rd)

B M(wr)

32 3

M(wr) 3

01 0

M(rd)

CM

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3]C :=

1

2

3

4

5

6

7

1

2

3 4

5

6

7

Figure 2.6: Memory array example code and initial dataflow diagram

The dependency and anti-dependency arrows in dataflow diagram in Figure2.6 are based solely

upon whether an operation is a read or a write. The arrows do not take into account the address

that is read from or written to.

In figure2.7, we have used knowledge about which addresses we are accessing to remove unneeded

dependencies. These are the real dependencies and match those shown in the code fragment for

figure2.6. In figure2.8 we have placed an ordering on the read operations and an ordering on the

write operations. The ordering is derived by obeying data dependencies and then rearranging the

operations to perform as many operations in parallel as possible.


M(wr)

M 21 2

M(wr)

31 30

M(rd) M(wr)

M 21 2

M(wr)

31 30

M(rd)

1 1 2

2.8. INPUT / OUTPUT PROTOCOLS 155

2.8 Input / Output Protocols

An important aspect of hardware design is choosing a input/output protocol that is easy to im-

plement and suits both your circuit and your environment Here are a few simple and common


88/304

M(wr)

2

M(rd)

M(wr)

A

M(rd)

B

M(wr)

32 3

M(wr)

01 0

3

M(rd)

CM

Figure 2.7: Memory array with minimal dependencies

M(wr)

2

M(rd)

M(wr)

A

M(rd)

B

M(wr)

32 3

M(wr)

01 0

3

M(rd)

CM

3

2

1 1 2

34

Figure 2.8: Memory array with orderings

M(wr)

2

M(rd)

M

21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 03

M(rd)

C M

3

2

1 1

2

3

4

Figure 2.9: Final version of Figure2.6

Put as many parallel operations into same clock cycle as allowed by resources (one write + one

read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent opera-

tions in separate clock cycles.

plement and suits both your circuit and your environment. Here are a few simple and commonprotocols.

rdy

data

ack

Figure 2.10: Four phase handshaking protocol

Used when timing of communication between producer and consumer is unpredictable. The dis-

advantage is that it is cumbersome to implement and slow to execute.

clk

data

valid

Figure 2.11: Valid-bit protocol

A low overhead (both in area and performance) protocol. Consumer must always be able to accept

incoming data. Often used in pipelined circuits. More complicated versions of the protocol can

handle pipeline stalls.

clk

data_in

start

done

data_out

Figure 2.12: Start/Done protocol

A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece

of data at a time and the time to compute the result is unpredictable.


2.9 Design Example: Massey

Well go through the following artifacts:

2.9.2 Algorithm 157

Maximum of two adders

Small miscellaneous hardware (e.g. muxes) is unlimited

Maximum of three inputs and one output

Design effort is unlimited


89/304

1. requirements

2. algorithm

3. dataflow diagram

4. high-level models

5. hardware block diagram

6. RTL code for datapath

7. state machine

8. RTL code for control

Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. Scheduling (allocate operations to clock cycles)

2. I/O allocation

3. First high-level model

4. Register allocation

5. Datapath allocation

6. Connect datapath components, insert muxes where needed

7. Design implicit state machine

8. Optimize

9. Design explicit-current state machine

10. Optimize

2.9.1 RequirementsFunctional requirements:

Compute the sum of six 8-bit numbers: o u t p u t = a + b + c + d + e + f

Use registers on both inputs and outputs

Performance requirements:

Maximum clock period: unlimited

Maximum latency: four

Cost requirements:

Design effort is unlimited

Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex-

pensive t han a full-adder. A 2:1 mux has three input s whil e a n a dder has only

two inputs (the carry-in and carry-out signals usually use the special verti-

cal connections on the FPGA cell). In FPGAs, sharing an adder between two

signals can be more expensive than having two adders. In a generic-gate

technology, a multiplexor contains three two-input gates, while a full-adder

contains fourteen two-input gates.

2.9.2 Algorithm

Well use parentheses to group operations so as to maximize our opportunities to perform the work

in parallel:

z = ( a + b ) + ( c + d ) + ( e + f )

This results in the following data-dependency graph:

a b c d e f

+

+

+

+

+


2.9.3 Initial Dataflow Diagram

a b c d

e f+ +

2.9.4 Dataflow Diagram Scheduling 159

Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Original parallel Parallel after scheduling

a b c d e f a b c d


90/304

z

e f+

+

+

+

+

This dataflow diagram violates the require-

ment to use at most three inputs.

2.9.4 Dataflow Diagram Scheduling

We can potentially optimize the inputs, outputs, area, and performance of a dataflow diagram by

rescheduling the operations, that is allocating the operations to different clock cycles.

Parallel algorithms have higher performance and greater scheduling flexibility than serial algo-

rithms

Ser

very good notes-up2

Documents