very good notes-up2

Upload: abhay-sorte

Post on 05-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Very Good Notes-up2

    1/304

    E&CE 427: Digital Systems Engineering

    Course Notes

    Mark Aagaard

    2006t3Fall

    University of Waterloo

    Dept of Electrical and Computer Engineering

    September 18, 2006

  • 8/2/2019 Very Good Notes-up2

    2/304

  • 8/2/2019 Very Good Notes-up2

    3/304

  • 8/2/2019 Very Good Notes-up2

    4/304

  • 8/2/2019 Very Good Notes-up2

    5/304

  • 8/2/2019 Very Good Notes-up2

    6/304

  • 8/2/2019 Very Good Notes-up2

    7/304

  • 8/2/2019 Very Good Notes-up2

    8/304

  • 8/2/2019 Very Good Notes-up2

    9/304

  • 8/2/2019 Very Good Notes-up2

    10/304

  • 8/2/2019 Very Good Notes-up2

    11/304

    CONTENTS xix

    P10.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    11 Problems on Faults, Testing, and Testability 99

    P11.1Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . . . 99

    P11.2Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    P11.3Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    P11.4Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    P11.5Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    P11.6Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    P11.7Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    P11.7.1Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    P11.7.2Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    P11.8Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104P11.9BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 05

    P11.9.1Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    P11.9.2Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    P11.9.3Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    P11.9.4Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 111

    P11.9.5Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 112

    P11.9.6Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    P11.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    P11.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 14

    P11.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    P11.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 16

    P11.12.1Are there any physical faults that are detectable by scan testing but not by

    built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    P11.12.2Are there any physical faults that are detectable by built-in self testing but

    not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    P11.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 17

    P11.13.1Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    P11.13.2Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    P11.13.3Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . . . 118

    P11.13.4Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    Part I

    Course Notes

    1

  • 8/2/2019 Very Good Notes-up2

    12/304

    Chapter 1

    VHDL: The Language

    1.1 Introduction to VHDL

    1.1.1 Levels of Abstraction

    There are many different levels of abstraction for working with hardware:

    Quantum: Schrodingers equations describe movement of electrons and holes through mate-rial.

    Energy band: 2-dimensional diagrams that capture essential features of Schrodingers equa-tions. Energy-band diagrams are commonly used in nano-scale engineering.

    Transistor: Signal values and time are continous (analog). Each transistor is modeled by aresistor-capacitor network. Overall behaviour is defined by differential equations in terms of

    the resistors and capacitors. Spice is a typical simulation tool.

    Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa-

    tions are used, rather than differential equations. A rising edge may be modeled as a linearrise over some range of time, or the time between a definite low value and a definite high

    value may be modeled as having an undefined or rising value.

    Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discretevalues such as pureBoolean (0 or 1) or IEEEStandardLogic 1164, which has representations

    for different types of unknown or undefined values. Time may be continuous or may be

    discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate

    has a delay of 1 and AND gate has a delay of 2).

    3

  • 8/2/2019 Very Good Notes-up2

    13/304

  • 8/2/2019 Very Good Notes-up2

    14/304

    6 CHAPTER 1. VHDL

    numeric_bit defines arithmetic over bit vectors and integers. We wont use bit

    signals in this course, so you dont need to worry about this package.

    1.1.3 Semantics

    The original goal of VHDL was to simulate circuits. The semantics of the language define circuit

    behaviour.

    a

    b

    c

    simulationc

  • 8/2/2019 Very Good Notes-up2

    15/304

  • 8/2/2019 Very Good Notes-up2

    16/304

  • 8/2/2019 Very Good Notes-up2

    17/304

    12 CHAPTER 1. VHDL

    determine which parts of the library are externally visible

    Use clause use a library in an entity/architecture or another package

    technically, use clauses are part of entities and packages, but they proceed the entity/package

    keyword, so we list them as top-level constructs

    Entity (section 1.3.3)

    define interface to circuit

    Architecture (section 1.3.3)

    define internal signals and gates of circuit

    1.3.3 Entities and Architecture

    Each hardware module is described with an Entity/Architecture pair

    architecture

    entity

    architecture

    entity

    Figure 1.1: Entity and Architecture

    Entity: interface names, modes (in / out), types of

    externally visible signals of circuit

    Architecture: internals

    structure and behaviour of module

    library ieee;use ieee.std_logic_1164.all;

    entity and_or is

    port (

    a, b, c : in std_logic ;

    z : out std_logic

    );

    end and_or;

    Figure 1.2: Example of an entity

    1.3.3 Entities and Architecture 13

    The syntax of VHDL is defined using a variation on Backus-Naur forms (BNF).

    [ { use_clause } ]entity ENTITYID is

    [ port (

    { SIGNALID : (in | out) TYPEID [ := expr ] ; });

    ]

    [ { declaration } ][ begin

    { concurrent_statement } ]end [ entity ] ENTITYID ;

    Figure 1.3: Simplified grammar of entity

    architecture main of and_or is

    signal x : std_logic;

    begin

    x

  • 8/2/2019 Very Good Notes-up2

    18/304

    14 CHAPTER 1. VHDL

    1.3.4 Concurrent Statements

    Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.6)

    Concurrent statements make VHDL fundamentally different from most software languages.

    Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hard-

    ware.

    At each infinitesimally small moment of time, each gate:

    1. samples its inputs

    2. computes the value of its output

    3. drives the output

    architecture main of bowser is

    begin

    x1

  • 8/2/2019 Very Good Notes-up2

    19/304

    16 CHAPTER 1. VHDL

    1.3.5 Component Declaration and Instantiations

    There are two different syntaxes for component declaration and instantiation. The VHDL-93 syn-

    tax is much more concise than the VHDL-87 syntax.

    Not all tools support the VHDL-93 syntax. For E&CE 427, some of the tools that we use do not

    support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax.

    1.3.6 Processes

    Processes are used to describe complex and potentially unsynthesizable behaviour

    A process is a concurrent statement (Section 1.3.4).

    The body of a process contains sequential statements (Section 1.3.7)

    Processes are the most complex and difficult to understand part of VHDL (Sections 1.5 and 1.6)

    process (a, b, c)

    begin

    y

  • 8/2/2019 Very Good Notes-up2

    20/304

    18 CHAPTER 1. VHDL

    1.3.8 A Few More Miscellaneous VHDL Features

    Some constructs that are useful and will be described in later chapters and sections:

    report : print a message on stderr while simulating

    assert : assertions about behaviour of signals, very useful with report statements.

    generics : parameters to an entity that are defined at elaboration time.

    attributes : predefined functions for different datatypes. For example: high and low indices of a

    vector.

    1.4 Concurrent vs Sequential Statements

    All concurrent assignments can be translated into sequential statements. But, not all sequential

    statements can be translated into concurrent statements.

    1.4.1 Concurrent Assignment vs Process

    The two code fragments below have identical behaviour:

    architecture main of tiny is

    begin

    b < = a ;

    end main;

    architecture main of tiny is

    begin

    process (a) begin

    b

    t < = ;

    when =>

    t < = ;

    end case;

    1.4.4 Coding Style

    Code thats easy to write with sequential statements, but difficult with concurrent:

    Sequential Statements

    case is

    when =>

    if then

    o < = ;

    else

    o < = ;

    end if;

    when =>

    . . .

    end case;

    Concurrent Statements

    Overall structure:with select

    t

  • 8/2/2019 Very Good Notes-up2

    21/304

    20 CHAPTER 1. VHDL

    1.5 Overview of Processes

    Processes are the most difficult VHDL construct to understand. This section gives an overview of

    processes. Section 1.6 gives the details of the semantics of processes.

    Within a process, statements are executed almost sequentially

    Among processes, execution is done in parallel

    Remember: a process is a concurrent statement!

    entity ENTITYID is

    interface declarations

    end ENTITYID ;

    architecture ARCHID of ENTITYID is

    begin

    concurrent statements =process begin

    sequential statements =end process;

    concurrent statements =end ARCHID;

    Figure 1.11: Sequential statements in a process

    Key concepts in VHDL semantics for processes: VHDL mimics hardware

    Hardware (gates) execute in parallel

    Processes execute in parallel with each other

    All possible orders of executing processes must produce the same simulation results (wave-forms)

    If a signal is not assigned a value, then it holds its previous value

    All orders of executing concurrent statements must

    produce the same waveforms

    It doesnt matter whether you are running on a single-threaded operating system, on a multi-

    threaded operating system, on a massively parallel supercomputer, or on a special hardware emu-

    lator with one FPGA chip per VHDL process all simulations must be the same.

    These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6)

    and lead to the phenomenon of latch-inference (Section 1.5.2).

    1.5. OVERVIEW OF PROCESSES 21

    architecture

    procA: process

    stmtA1;

    stmtA2;

    stmtA3;

    end process;

    procB: process

    stmtB1;

    stmtB2;

    end process;

    execution sequence

    A1

    A2

    A3

    B1

    B2

    execution sequence

    A1

    A2

    A3

    B1

    B2

    execution sequence

    A1

    A2

    A3

    B1

    B2

    single threaded:procA before procB

    single threaded:procB before procA

    multithreaded: procA

    and procB in parallel

    Figure 1.12: Different process execution sequences

    Figure 1.13: All execution orders must have same behaviour

    Sections 1.5.11.5.3 discuss the hardware generated by processes.

    Sections 1.61.6.5 discuss the behaviour and execution of processes.

  • 8/2/2019 Very Good Notes-up2

    22/304

    22 CHAPTER 1. VHDL

    1.5.1 Combinational Process vs Clocked Process

    Each well-written synthesizable process is either combinational or clocked. Some synthesizable

    processes that do not conform to our coding guidelines are both combintational and clocked. For

    example, in a flip-flop with an asynchronous reset, the output is a combinational function of the

    reset signal and a clocked function of the data input signal. We will deal with only with processes

    that follow our coding conventions, and so we will continue to say that each process is either

    combinational xor clocked.

    Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry

    A combinational processes must have a sensitivity list

    A combinational process must not have any wait statements

    A combinational process must not have any rising_edges, or falling_edges

    The hardware for a combinational process is just combinational circuitry

    Clocked process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Executing the process takes one (or more) clock cycles Target signals are outputs of flops

    Process contains one or more wait or if rising edge statements

    Hardware contains combinational circuitry and flip flops

    Note: Clocked processes are sometimes called sequential processes,

    but this can be easily confused with sequential statements, so in E&CE 427

    well refer to synthesizable processes as either combinationalor clocked.

    Example Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Combinational Process

    process (a,b,c)

    p1

  • 8/2/2019 Very Good Notes-up2

    23/304

  • 8/2/2019 Very Good Notes-up2

    24/304

  • 8/2/2019 Very Good Notes-up2

    25/304

    28 CHAPTER 1. VHDL

    1.6.2.4 Delta-Cycle Definitions

    Definition simulation step: Executing one sequential assignment or process mode

    change.

    Definition simulation cycle: The operations that occur in one iteration of the simulation

    algorithm.

    Definition delta cycle: A simulation cycle that does not advance simulation time.

    Equivalently: A simulation cycle with zero-delay assignments where the assignment

    causes a process to resume.

    Definition simulation round: A sequence of simulation cycles that all have the same

    simulation time. Equivalently: a contiguous sequence of zero or more delta cycles

    followed by a simulation cycle that increments time (i.e., the simulation cycle is not a

    delta cycle).

    Note: Official and unofficial terminology Simulation cycle and delta cycle

    are official definitions in the VHDL Standard. Simulation step and simulation

    round are not standard definitions. They are used in E&CE 427 because weneed words to associate with the concepts that they describe.

    1.6.3 Example 1: Process Execution (Bamboozle) 29

    1.6.3 Example 1: Process Execution (Bamboozle)

    This example (Bamboozle) and the next example (Flummox, section 1.6.4) are very similar. The

    VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The

    stimulus for signals a and b also differs.

    entity bamboozle is

    begin

    end bamboozle;

    architecture main of bamboozle is

    signal a, b, c, d : std_logic;

    beginprocA : process (a, b) begin

    c < = a A N D b ;

    end process;

    procB : process (b, c, d)

    begin

    d

  • 8/2/2019 Very Good Notes-up2

    26/304

    30 CHAPTER 1. VHDL

    Initial conditions (Shown in slides, not in notes)

    Step 1(a): Activate procA(Shown in slides, not in notes)

    a

    b

    c d

    e

    U

    U

    U UU

    procA: process (a, b) begin

    c

  • 8/2/2019 Very Good Notes-up2

    27/304

    32 CHAPTER 1. VHDL

    a

    b

    c d

    e

    U UU

    procA: process (a, b) begin

    c

  • 8/2/2019 Very Good Notes-up2

    28/304

    34 CHAPTER 1. VHDL

    Begin next simulation cycle (Shown in slides, not in notes)

    Step 1(a): Activate procB (Shown in slides, not in notes)

    Step 1(b): Provisional assignment to d (Shown in slides, not in notes)

    Step 1(b): Provisional assignment to e (Shown in slides, not in notes)

    Step 1(c): Suspend procB (Shown in slides, not in notes)

    All processes suspended (Shown in slides, not in notes)

    a

    b

    c d

    e

    0 UU

    procA: process (a, b) begin

    c

  • 8/2/2019 Very Good Notes-up2

    29/304

    36 CHAPTER 1. VHDL

    Begin next simulation cycle (Shown in slides, not in notes)

    Step 1: No postponed processes (Shown in slides, not in notes)

    a

    b

    c d

    e

    procA: process (a, b) begin

    c

  • 8/2/2019 Very Good Notes-up2

    30/304

    38 CHAPTER 1. VHDL

    1.6.4 Example 2: Process Execution (Flummox)

    This example is a variation of the Bamboozle example from section 1.6.3.

    entity flummox is

    begin

    end flummox;

    architecture main of flummox is

    signal a, b, c, d : std_logic;

    begin

    proc1 : process (a, b, c) begin

    c < = a A N D b ;d

  • 8/2/2019 Very Good Notes-up2

    31/304

    40 CHAPTER 1. VHDL

    Answer:

    simulation step, delta cycle, simulation cycle, simulation round

    Question: What is the order of granularity, from finest to coarsest, amongst the

    different granularities related to delta-cycle simulation?

    Answer:

    Same order as listed just above. Note: delta cycles have a finer granularitythat simulation cycles, because delta cycles do not advance time, whilesimulation cycles that are not delta cycles do advance time.

    1.6.5 Example: Need for Provisional Assignments

    This is an example of processes where updating signals during a simulation cycle leads to different

    results for different process execution orderings.

    architecture main of swindle is

    begin

    p_c: process (a, b) begin

    c < = a A N D b ;end process;

    p_d: process (a, c) begin

    d < = a X O R c ;

    end process;

    end main;

    a

    b

    cd

    Figure 1.18: Circuit to illustrate need for provisional assignments

    1.6.5 Example: Need for Provisional Assignments 41

    1. Start with all signals at 0.

    2. Simultaneously change to a = 1 and b = 1.

    . .

    If assignments are not visible within same simulation cycle (correct: i.e. provisional

    assignments are used)

    a

    b

    c

    d

    0

    0

    0

    0

    p_d

    p_c P

    P

    A S

    A S P A S

    If p c is scheduled before p d, then d will

    have a 1 pulse.

    a

    b

    c

    d

    0

    0

    0

    0

    p_d

    p_c P

    P

    A S

    A S P A S

    Ifp d is scheduled before p c, then d will

    have a 1 pulse.

    . .

    If assignments are visible within same simulation cycle (incorrect)

    a

    b

    c

    d

    0

    0

    0

    0

    p_d

    p_c P

    P

    A S

    A S P A S

    If p c is scheduled before p d, then d will

    stay constant 0.

    a

    b

    c

    d

    0

    0

    0

    0

    p_d

    p_c P

    P

    A S

    A S P A S

    Ifp d is scheduled before p c, then d will

    have a 1 pulse.

    With provisional assignments, both orders of scheduling processes result in the same behaviour

    on all signals. Without provisional assignments, different scheduling orders result in different

    behaviour.

  • 8/2/2019 Very Good Notes-up2

    32/304

    42 CHAPTER 1. VHDL

    1.6.6 Delta-Cycle Simulations of Flip-Flops

    This example illustrates the delta-cycle simulation of a flip-flop. Notice how the delta-cycle simu-lation captures the expected behaviour of the flip flop: the signal q changes at the same time (10ns)

    as rising edge on the clock.

    p_a : process begin

    a

  • 8/2/2019 Very Good Notes-up2

    33/304

    44 CHAPTER 1. VHDL

    Testbenches and Clock Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    env : process begin

    a

  • 8/2/2019 Very Good Notes-up2

    34/304

    46 CHAPTER 1. VHDL

    RTL Simulation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    1. Pre-processing

    (a) Separate processes into combinational and non-combinational (clocked and timed)

    (b) Decompose each combinational process into separate processes with one target signal

    per process

    (c) Sort processes into topological order based on dependencies

    2. For each clock cycle or unit of time:

    (a) Run non-combinational processes in any order. Non-combinational assignments read

    from earlier clock cycle / time step.

    (b) Run combinational processes in topological order. Combinational assignments read

    from current clock cycle / time step.

    1.7.2 Examples of RTL Simulation

    Combinational Process Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    proc(a,b,c)

    if a = 1 then

    d < = b ;

    e < = c ;

    else

    d

  • 8/2/2019 Very Good Notes-up2

    35/304

    48 CHAPTER 1. VHDL

    8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to

    102ns.

    9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to

    102ns.

    Question: Draw the RTL waveforms that correspond to the delta-cycle waveform

    below.

    a

    b

    c

    d

    e

    proc1

    proc2

    proc3

    delta cycle

    sim cycle

    sim round B

    B

    BP

    P

    P

    U

    U

    U

    U

    U

    A

    U

    S

    A

    1

    0

    S

    A S

    U

    U

    E

    E

    P

    P

    A

    0

    U

    S

    A S

    B

    B E

    E

    P A S

    0

    1

    B

    B E

    E

    P A S

    0

    B E

    E

    P A S

    1

    P

    P A S

    1

    A S

    1

    1

    B

    B

    B

    E

    EP A S

    1

    0

    P A S

    0

    102ns

    0

    B

    BE

    E E

    E

    E

    B

    B

    0ns 3ns

    BE

    E

    U

    0ns+1 0ns+2 0ns+2 3ns+1 3ns+2 3ns+3

    Answer:

    a

    b

    c

    d

    e

    U

    U

    U

    U

    U

    1

    0

    0

    1

    0

    1

    1

    0

    0ns 1ns 2ns 3ns 102ns

    1.7.2 Examples of RTL Simulation 49

    Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Note: It is easier to do a simulation by hand if you start your clock at 0

    and use the first clock phase in the waveform diagram for the first values that

    your VHDL code ass igns t o si gnals

    Simulate If-Then-Else, Wait Until . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    huey: process

    begin

    clk

  • 8/2/2019 Very Good Notes-up2

    36/304

    50 CHAPTER 1. VHDL

    A Related Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Small changes to the code can cause significant changes to the behaviour.riri: process

    begin

    clk

  • 8/2/2019 Very Good Notes-up2

    37/304

  • 8/2/2019 Very Good Notes-up2

    38/304

    54 CHAPTER 1. VHDL

    1.8.3.3 Flops with Chip-Enable

    The two code fragments below synthesize to identical hardware (flops with chip-enable lines).

    If

    process (clk)

    begin

    if rising_edge(clk) then

    if (ce = 1) then

    q

  • 8/2/2019 Very Good Notes-up2

    39/304

    56 CHAPTER 1. VHDL

    (a) Flops use if statements

    (b) Flops use wait statements

    Some examples of these different options are shown in figures1.211.24.

    S

    R

    S

    R

    sel reset

    clk

    c

    a

    entity and_not_reg is

    port (

    reset,

    clk,

    s el : in st d_ lo gi c;

    c : out std_logic

    );

    end;

    Schematic and entity for examples of different code organizations in Figures1.211.24

    Figure 1.20: Schematic and entity for and not reg

    One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    architecture one_proc of and_not_reg is

    signal a : std_logic;

    begin

    process begin

    wait until rising_edge(clk);

    if (reset = 1) then

    a

  • 8/2/2019 Very Good Notes-up2

    40/304

    58 CHAPTER 1. VHDL

    Two Processes with If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    architecture two_proc_if of and_not_reg is

    signal a : std_logic;

    begin

    process (clk)

    begin

    if rising_edge(clk) then

    if (reset = 1) then

    a

  • 8/2/2019 Very Good Notes-up2

    41/304

  • 8/2/2019 Very Good Notes-up2

    42/304

    62 CHAPTER 1. VHDL

    1.10.4 Different Widths and Arithmetic

    Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)

    target src1/2 src2/1

    narrow wide fails in elaboration

    wide narrow int fails in elaboration

    wide wide OK

    narrow narrow narrow OK

    narrow narrow int OK

    Example vectorswide unsigned(7 downto 0)

    narrow unsigned(4 downto 0)

    1.10.5 Overloading of Comparisons

    Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, =, >,

  • 8/2/2019 Very Good Notes-up2

    43/304

    66 CHAPTER 1 VHDL 1 11 1 U th i bl C d 67

  • 8/2/2019 Very Good Notes-up2

    44/304

    66 CHAPTER 1. VHDL

    1.11.1.4 Multiple if rising edges in Same Process

    Multiple if rising edge statements in a process (UNSYNTHESIZABLE)

    process (clk)

    begin

    if rising_edge(clk) then

    q0

  • 8/2/2019 Very Good Notes-up2

    45/304

    68 CHAPTER 1. VHDL

    Synthesizable Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-

    edge outside of the for loop.

    process (clk) begin

    if rising_edge(clk) then

    f or i in 0 to 7 lo op

    q(i)

  • 8/2/2019 Very Good Notes-up2

    46/304

  • 8/2/2019 Very Good Notes-up2

    47/304

  • 8/2/2019 Very Good Notes-up2

    48/304

    76 CHAPTER 1. VHDL P1.2 VHDL Syntax 77

  • 8/2/2019 Very Good Notes-up2

    49/304

    1.13 VHDL Problems

    P1.1 IEEE 1164

    For each of thevalues in thelist below, answer whether or notit is defined in theieee.std_logic_1164

    library. If it is part of the library, write a 23 word description of the value.

    Values: -, #, 0, 1, A, h, H, L, Q, X, Z.

    P1.2 VHDL Syntax

    Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.

    NOTES: 1) ... represents a fragment of legal VHDL code.

    2) For full marks, if the code is illegal, you must explain why.

    3) The code has been written so that, if it is illegal, then it is illegal for both

    simulation and synthesis.

    q2a

    architecture main of anchiceratops is

    signal a, b, c : std_logic;begin

    process begin

    wait until rising_edge(c);

    a p, b => q);

    ...

    end main;

    q2e

    architecture main of pachyderm is

    function inv(a : std_logic)

    return std_logic is

    begin

    return(NOT a);

    end inv;

    signal p, b : std_logic;

    begin

    p a);

    ...

    end main;

    q2f

    architecture main of apatosaurus istype state_ty is (S0, S1, S2);

    signal st : state_ty;

    signal p : std_logic;

    begin

    case st is

    when S0 | S1 => p p

  • 8/2/2019 Very Good Notes-up2

    50/304

    P1.3 Flops, Latches, and Combinational Circuitry

    For each of the signals p...z in the architecture main ofmontevido, answer whether the signalis a latch, combinational gate, or flip-flop.

    entity montevido is

    port (

    a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;

    l : in std_logic_vector (1 downto 0);

    p, q, r, s, t, u, v, w, x, y, z : out std_logic

    );

    end montevido;

    architecture main of montevido issignal i, j : std_logic;

    begin

    i

  • 8/2/2019 Very Good Notes-up2

    51/304

    entity bigckt is

    port (

    a, b : in std_logic;

    c : out std_logic

    );

    end bigckt;

    architecture main of bigckt is

    beginprocess (a, b)

    begin

    if (a = 0) then

    c

  • 8/2/2019 Very Good Notes-up2

    52/304

    P1.6 Delta-Cycle Simulation: Pong

    Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram.

    INSTRUCTIONS:

    1. The simulation is to be done at the granularity of simulation-steps.

    2. Show all changes to process modes and signal values.

    3. Each column of the timing diagram corresponds to a simulation step that changes a signal or

    process.

    4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation

    round by writing in the appropriate row a B at the beginning and an E at the end of the cycle

    or round.5. End your simulation just before 20 ns.

    architecture main of pong_machine is

    signal ping_i, ping_n, pong_i, pong_n : std_logic;

    begin

    reset_proc: process

    reset

  • 8/2/2019 Very Good Notes-up2

    53/304

    P1.8 Clock-Cycle Simulation

    Given the VHDL code for anapurna and waveform diagram below, answer what the values ofthe signals y, z, and p will be at the given times.

    entity anapurna is

    port (

    clk, reset, sel : in std_logic;

    a, b : in unsigned(15 downto 0);

    p : out unsigned(15 downto 0)

    );

    end anapurna;

    architecture main of anapurna is

    type state_ty is (mango, guava, durian, papaya);

    signal y, z : unsigned(15 downto 0);

    signal state : state_ty;

    begin

    proc_herzog: process

    begin

    top_loop: loop

    wait until (rising_edge(clk));

    next top_loop when (reset = 1);

    state

  • 8/2/2019 Very Good Notes-up2

    54/304

    P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega

    For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviouras it does in the main architecture ofichthyostega?

    NOTES: 1) For full marks, if the code has different behaviour, you must explain

    why.

    2) Ignore any differences in behaviour in the first few clock cycles that is

    caused by initialization of flip-flops, latches, and registers.

    3) All code fragments in this question are legal, synthesizable VHDL code.

    entity ichthyostega is

    port (

    clk : in std_logic;

    b, c : in signed(3 downto 0);

    v : out sig ne d( 3 d own to 0)

    );

    end ichthyostega;

    architecture main of ichthyostega is

    signal bx, cx : signed(3 downto 0);

    begin

    process begin

    wait until (rising_edge(clk));bx

  • 8/2/2019 Very Good Notes-up2

    55/304

    P1.11 Waveform VHDL Behavioural Comparison

    Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour asthe timing diagram.

    NOTES: 1) Same behaviour means that the signals a, b, and c have the same values at

    the end of each clock cycle in steady-state simulation (ignore any irregularities

    in the first few clock cycles).

    2) For full marks, if the code does not match, you must explain why.

    3) Assume that all signals, constants, variables, types, etc are properly defined

    and declared.

    4) All of the code fragments are legal, synthesizable VHDL code.

    clk

    a

    b

    c

    q3aarchitecture q3a of q3 is

    begin

    process begina

  • 8/2/2019 Very Good Notes-up2

    56/304

    P1.12 Hardware VHDL Comparison

    For each of the circuits q2aq2d, answer

    whether the signal d has the same behaviour

    as it does in the main architecture of q2.

    entity q2 is

    port (

    a, clk, reset : in std_logic;

    d : out std_logic

    );

    end q2;

    architecture main of q2 is

    signal b, c : std_logic;

    begin

    b < = 0 whe n (r es et = 1 )

    else a;

    process (clk) begin

    if rising_edge(clk) then

    c < = b ;

    d < = c ;

    end if;

    end process;

    end main;

    q2a clk

    a

    0

    reset

    d

    q2b clk

    a

    0

    reset

    d

    q2c clk

    a

    0

    reset

    d

    q2d clk

    a

    0

    reset

    d

    clk

    P1.13 8-Bit Register

    Implement an 8-bit register that has: clock signal clk

    input data vector d

    output data vector q

    synchronous active-high input reset

    synchronous active-high input enable

    P1.13.1 Asynchronous Reset

    Modify your design so that the reset signal is asynchronous, rather than synchronous.

    P1.13.2 Discussion

    Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on

    an FPGA.

    P1.13.3 Testbench for Register

    Write a test bench to validate the functionality of the 8-bit register with synchronous reset.

    92 CHAPTER 1. VHDL P1.14 Synthesizable VHDL and Hardware 93

  • 8/2/2019 Very Good Notes-up2

    57/304

    P1.14 Synthesizable VHDL and Hardware

    For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If thecode is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of

    the code. If the the code is not synthesizable, explain why.

    q4a

    process begin

    wait until rising_edge(a);

    e < = d ;

    wait until rising_edge(b);

    e

  • 8/2/2019 Very Good Notes-up2

    58/304

    P1.15 Datapath Design

    Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit.The circuit is intended to perform the following sequence of operations (not all operations are

    required to use a clock cycle):

    read in source and destination addresses from i src1,i src2, i dst

    read operands op1 and op2 from memory

    compute sum of operands sum

    write sum to memory at destination address dst

    write sum to output o result

    i_src1

    i_src2

    i_dst

    o_result

    clk

    P1.15.1 Correct Implementation?

    For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation

    of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in

    which cycle you need load=1.

    NOTES:1. You may choose the number of clock cycles required to execute the sequence of operations.

    2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0.

    3. The control circuitry that controls the datapath will output a signal load, which will be 1when the sum is to be written into memory.

    4. The code fragment with the signal declaractions, connections for inputs and outputs, and the

    instantiation of memory is to be used for all three code fragments q4aq4c.

    5. The memory has registered inputs and combinational (unregistered) outputs.

    6. All of the VHDL is legal, synthesizable code.

    -- This code is to be used for

    -- all three code fragments q4a--q4c.

    signal state : std_logic_vector(3 downto 0);

    signal src1, src2, dst, op1, op2, sum,mem_in_a, mem_out_a, mem_out_b,

    mem_addr_a, mem_addr_b

    : unsigned(7 downto 0);

    ...

    process (clk)

    begin

    if rising_edge(clk) then

    src1 mem_we,

    i_data_a => mem_in_a,

    o_data_a => mem_out_a,

    o_data_b => mem_out_b);

    96 CHAPTER 1. VHDL P1.15 Datapath Design 97

  • 8/2/2019 Very Good Notes-up2

    59/304

    q4a

    op1 0);op2 0);

    sum 0);

    mem_in_a 0);

    mem_addr_a

  • 8/2/2019 Very Good Notes-up2

    60/304

    Chapter 2

    RTL Design with VHDL: From

    Requirements to Optimized Code

    2.1 Prelude to Chapter

    2.1.1 A Note on EDA for FPGAs and ASICs

    The following is from John Cooleys column The Industry Gadfly from 2003/04/30. The title of

    this article is: The FPGA EDA Slums.

    For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the

    FPGA market was US$2.6 billion.

    Whats more interesting is that the 2001 ASIC EDA market was US$2.2 billion while

    the FPGA EDA market was US$91.1 million. Nope, thats not a mistake. Its ASIC

    EDA and billion versus FPGA EDA and million. Do the math and youll see that for

    every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor.

    For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor.

    Not good.

    Its the old free milk and a cow story according to Gary Smith, the Senior EDA

    Analyst at Dataquest. Altera and Xilinx have fowled their own nest. Their free tools

    spoil the FPGA EDA market, says Gary. EDA vendors know that theres no money

    to be made in FPGA tools.

    99

    100 CHAPTER 2. RTL DESIGN WITH VHDL

    2 2 FPGA B k d d C di G id li

    2.2.2 Area Estimation 101

  • 8/2/2019 Very Good Notes-up2

    61/304

    2.2 FPGA Background and Coding Guidelines

    2.2.1 Generic FPGA Hardware

    2.2.1.1 Generic FPGA Cell

    Cell = Logic Element (LE) in Altera

    = Configurable Logic Block (CLB) in Xilinx

    CE

    S

    RD Q

    comb_data_in

    ctrl_in

    carry_in

    carry_out

    flop_data_outcomb

    comb_data_out

    flop_data_in

    2.2.2 Area Estimation

    We estimate the number of FPGA cells required for a design by counting the number of flip-

    flops and primary inputs that are in the fanin of each flip-flop. Only flip-flops count, because

    combinational signals are collapsed into the circuity within an FPGA cell. The circuitry for any

    flip-flop signal with up to four source flip-flops can be implemented on a single FPGA cell. If a

    flip-flop signal is dependent upon five source flip-flops, then two FPGA cells are required.

    Source flops/inputs Minimum cells

    1 1

    2 1

    3 1

    4 1

    5 2

    6 2

    7 2

    8 3

    9 3

    10 3

    11 4

    For a single target signal, this technique gives a lower bound on the number of cells needed. For

    example, some functions of seven inputs require more than two cells. As a particular example, a

    four-to-one multiplexer has six inputs and requires three cells.

    When dealing with multiple target signals, this technique might be an overestimate, because a

    single cell can drive several other cells (common subexpression elimination).

    PLA and Flop for Different Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    CE

    S

    RD Q

    comb_data_in

    ctrl_in

    carry_in

    carry_out

    flop_data_outcomb

    comb_data_out

    flop_data_in

    PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    CE

    S

    RD Q

    comb_data_in

    ctrl_in

    carry_in

    carry_out

    flop_data_outcomb

    comb_data_out

    flop_data_in

    102 CHAPTER 2. RTL DESIGN WITH VHDL

    PLA d Fl f S F ti

    2.2.2 Area Estimation 103

    E ti t A f Ci it

  • 8/2/2019 Very Good Notes-up2

    62/304

    PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    CE

    S

    RD Q

    comb_data_in

    ctrl_in

    carry_in

    carry_out

    flop_data_outcomb

    comb_data_out

    flop_data_in

    Estimate Area for Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Question: Map the combinational circuits below onto generic FPGA cells.

    a

    b

    c

    d

    zCE

    S

    RD Q

    comb

    abcd

    z

    a

    b

    c

    dz y

    xe

    f

    g

    h

    i

    CE

    S

    RD Q

    comb

    CE

    S

    RD Q

    comb

    xz

    y

    zy

    abcd

    a

    b

    c

    dz

    w

    xe

    f

    g

    h

    i

    y

    CE

    S

    RD Q

    comb

    CE

    S

    RD Q

    comb

    CE

    S

    RD Q

    comb

    xz

    y

    zy

    abcd

    bcd

    w

    104 CHAPTER 2. RTL DESIGN WITH VHDL

    2 2 2 1 Interconnect for Generic FPGA

    2.2.2 Area Estimation 105

  • 8/2/2019 Very Good Notes-up2

    63/304

    2.2.2.1 Interconnect for Generic FPGA

    Note: In these slides, the space between tightly grouped wires sometimes

    disappears, making a group of wires appear to be a single large wire.

    There are two types of wires that connect a cell to the rest of the chip:

    General purpose interconnect (configurable, slow)

    Carry chains and cascade chains (verticaly adjacent cells, fast)

    2.2.2.2 Blocks of Cells for Generic FPGA

    Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within

    a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks

    might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested

    for-generate statements that replicate a single component (cell) hundreds of thousands of

    times.

    Cells not used for computation can be used as wires to shorten length of path between cells.

  • 8/2/2019 Very Good Notes-up2

    64/304

    108 CHAPTER 2. RTL DESIGN WITH VHDL

    2.2.4 Altera APEX20K Information and Coding Guidelines

    2.3. DESIGN FLOW 109

    2.3 Design Flow

  • 8/2/2019 Very Good Notes-up2

    65/304

    2.2.4 Altera APEX20K Information and Coding Guidelines

    APEX20K Block Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Chip

    52 Mega Logic Array Blocks (MegaLABs)

    1 Embedded System Block (ESB)

    Memory and wide combinational

    functions

    16 Logic Array Blocks (LABs)

    10 Logic Elements (LEs)

    4-input lookup table

    Carry and cascadeFlip-flop

    Each level of hierarchy has its own interconnect (wires).

    LE Computation and Storage . . . . . . . . .

    4-input lookup table (LUT)

    Carry-chain computation circuitry

    Cascade-chain computation circuitry

    Flip-flop with load, clear, clock-enable

    LE Interconnect . . . . . . . . . . . . . . . . . . . . . .

    4 data inputs 2 data outputs

    Carry in, carry out

    Cascade in, cascade out

    Clock, clock-enable

    Async clear, synch set (load), synch clear(reset)

    Global reset

    Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    The Altera APEX20K chips initialize all flip flops to 0 at startup. To mimic this behaviour in

    simulation, you should put an initial value of 0 on all flip flops. If you are doing your own

    encoding for a state machine, choose the reset state to be encoded as all zeroes.

    You should not put initial values on inputs or combinational signals.

    2.3 Design Flow

    2.3.1 Generic Design Flow

    Most people agree on the general terminology and process for a digital hardware design flow.

    However, each book and course has its own particular way of presenting the ideas. Here we will

    lay out the consistent set of definitions that we will use in E&CE 427. This might be different from

    what you have seen in other courses or on a work term. Focus on the ideas and you will be fine

    both now and in the future.

    The design flow presented here focuses on the artifacts that we work with, rather than the opera-

    tions that are performed on the artifacts. This is because the same operations can be performed at

    different points in the design flow, while the artifacts each have a unique purpose.

    Analyze

    Modify

    Analyze

    Modify

    Analyze

    Modify

    Analyze

    Modify

    Analyze

    Modify

    Requirements

    Opt. RTL Code

    Implementation

    Hardware

    DP+Ctrl Code

    High-Level Model

    dp/ctrl

    specific

    Algorithm

    Figure 2.1: Generic Design Flow

  • 8/2/2019 Very Good Notes-up2

    66/304

    112 CHAPTER 2. RTL DESIGN WITH VHDL

    Storage

    2.4. ALGORITHMS AND HIGH-LEVEL MODELS 113

    2.3.3.3 Control-Centric Design Flow

  • 8/2/2019 Very Good Notes-up2

    67/304

    Purpose: hold data for future use

    Data is not modified while stored

    Examples: register files, FIFO queues

    Control

    Purpose: modify internal state based on inputs, compute outputs from state and inputs

    Mostly individual signals, few data (vectors)

    Examples: bus arbiters, memory-controllers

    All three classes of circuits (datapath, control, and storage) follow the same generic design flow

    (Figure2.1) and use dataflow diagrams, hardware block diagrams, and state machines. The differ-

    ences in the design flows appear in the relative amount of effort spent on each type of description

    and the order in which the different descriptions are used. The differences are most pronouncedin the transition from the high-level model to the model that separates the datapath and control

    circuitry.

    2.3.3.2 Datapath-Centric Design Flow

    Analyze

    Modify

    Analyze

    Modify

    Block Diagram State Machine

    High-Level Model

    Dataflow

    DP+Ctrl RTL Code

    Figure 2.2: Datapath-Centric Design Flow

    Analyze

    Modify

    Analyze

    Modify

    Analyze

    Modify

    High-Level Model

    State Machine

    Dataflow Diagram

    Block Diagram

    DP+Ctrl RTL Code

    Figure 2.3: Control-Centric Design Flow

    2.3.3.4 Storage-Centric Design Flow

    In E&CE 427, we wont be discussing storage-centric design. Storage-centric design differs from

    datapath- and control-centric design in that storage-centric design focusses on building many repli-

    cated copies of small cells.

    Storage-centric designs include a wide range of circuits, from simple memory arrays to compli-

    cated circuits such as register files, translation lookaside buffers, and caches. The complicated

    circuits can contain large and very intricate state machines, which would benefit from some of the

    techniques for control-centric circuits.

    2.4 Algorithms and High-Level Models

    For designs with significant control flow, algorithms can be described in software languages, flow-

    charts, abstract state machines, algorithmic state machines, etc.

    For designs with trivial control flow (e.g. every parcel of input data undergoes the same computa-

    tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm.

    For designs with a small amount of control flow (e.g. a microprocessor, where a single decision is

    made based upon the opcode) a set of data-dependency graphs is often a good choice.

    114 CHAPTER 2. RTL DESIGN WITH VHDL

    Software executes in series;

    2.4.3 High-Level Models 115

    2.4.3 High-Level Models

  • 8/2/2019 Very Good Notes-up2

    68/304

    ;hardware executes in parallel

    When creating an algorithmic description of your hardware design, think about how you can repre-

    sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism

    to improve the performance of your design.

    2.4.1 Flow Charts and State Machines

    Flow charts and various flavours of state machines are covered well in many courses. Generally

    everything that youve learned about these forms of description are also applicable in hardware

    design.

    In addition, you can exploit parallelism in state machine design to create communicating finite state

    machines. A single complex state machine can be factored into multiplesimple state machines that

    operate in parallel and communicate with each other.

    2.4.2 Data-Dependency Graphs

    In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount

    of time to execute as: ( a + b ) + ( c + d ) + ( e + f ) .

    But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guideparallel vs serial execution.

    Datadependency graphs capture algorithms of datapath-centric designs.

    Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes

    the same computation.

    Serial Parallel

    (((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)a b c d e f

    +

    +

    +

    +

    +

    a b c d e f

    +

    +

    +

    +

    +

    5 adders on longest path (slower) 3 adders on longest path (faster)

    5 adders used (equal area) 5 adders used (equal area)

    There are many different types of high-level models, depending upon the purpose of the model

    and the characteristics of the design that the model describes. Some models may capture power

    consumption, others performance, others data functionality.

    High-level models are used to estimate the most important design metrics very early in the design

    cycle. If power consumption is more important that performance, then you might write high-

    level models that can predict the power consumption of different design choices, but which has

    no information about the number of clock cycles that a computation takes, or which predicts the

    latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate

    high-level models that do not contain any information about power consumption.

    Conventionally, performance has been the primary design metric. Hence, high-level models that

    predict performance are more prevalent and more well understood than other types of high-levelmodels. There are many research and entrepreneurial opportunities for people who can develop

    tools and/or languages for high-level models for estimating power, area, maximum clock speed,

    etc.

    In E&CE 427 we will limit ourselves to the well-understood area of high-level models for perfor-

    mance prediction.

  • 8/2/2019 Very Good Notes-up2

    69/304

    118 CHAPTER 2. RTL DESIGN WITH VHDL

    As with all topics in E&CE 427, there are tradeoffs between these different styles of writing state

    machines Most books teach only the explicit current+next style This style is the style closest to

    2.5.2 Implementing a Simple Moore Machine 119

    2.5.2.1 Implicit Moore State Machine

  • 8/2/2019 Very Good Notes-up2

    70/304

    machines. Most books teach only the explicit-current+next style. This style is the style closest to

    the hardware, which means that they are more amenable to optimization through human interven-

    tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style isthat they are concise and readable for control flows consisting of nested loops and branches (e.g.

    the type of control flow that appears in software). For control flows that have less structure, it

    can be difficult to write an implicit state machine. Very few books or synthesis manuals describe

    multiple-wait statement processes, but they are relatively well supported among synthesis tools.

    Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difficult to

    write some state machines with complicated control flows in an implicit style. The following

    example illustrates the point.

    s0/0

    s1/1

    s2/0

    s3/0

    a

    !a

    !a

    a

    Note: The terminology of explicit and implicit is somewhat standard,

    in that some descriptions of processes with multiple wait statements describe

    the processes as having implicit state machines.

    There is no standard terminology to distinguish between the two explicit styles:

    explicit-current+next and explicit-current.

    2.5.2 Implementing a Simple Moore Machine

    s0/0

    s1/1 s2/0

    s3/0

    a !aentity simple is

    port (

    a, clk : in std_logic;z : out std_logic

    );

    end simple;

    architecture moore_implicit of simple is

    beginprocess

    begin

    z

  • 8/2/2019 Very Good Notes-up2

    71/304

    architecture moore_explicit_v1 of simple is

    type state_ty is (s0, s1, s2, s3);signal state : state_ty;

    begin

    process (clk)

    begin

    if rising_edge(clk) then

    case state is

    when s0 =>

    if (a = 1) then

    state

  • 8/2/2019 Very Good Notes-up2

    72/304

    architecture moore_explicit_v3 of simple is

    type state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;

    begin

    process (clk)

    begin

    if rising_edge(clk) then

    state

  • 8/2/2019 Very Good Notes-up2

    73/304

    Mealy machines have a combinational path from inputs to outputs, which often violates good

    coding guidelines for hardware. Thus, Moore machines are much more common. You shouldknow how to write a Mealy machine if needed, but most of the state machines that you design will

    be Moore machines.

    This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine

    is the same as the Moore machine, except for the timing relationship between the output ( z) and

    the input (a).

    s0

    s1 s2

    s3

    a/1 !a/0

    /0/0

    entity simple isport (

    a, clk : in std_logic;

    z : out std_logic

    );

    end simple;

    Note: An implicit Mealy state machine is nonsensical.

    In an implicit state machine, we do not have a state signal. But, as the example below illustrates,

    to create a Mealy state machine we must have a state signal.

    An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen-

    dent upon the input in the current clock cycle, the output cannot be a flop. For the output to be

    combinational and dependent upon both the current state and the current input, we must create a

    state signal that we can read in the assignment to the output. Creating a state signal obviates the

    advantages of using an implicit style of state machine.

    architecture implicit_mealy of simple is

    type state_ty is (s0, s1, s2, s3);

    signal state : state_ty;

    begin

    process

    begin

    state

  • 8/2/2019 Very Good Notes-up2

    74/304

    architecture mealy_explicit of simple is

    type state_ty is (s0, s1, s2, s3);signal state : state_ty;

    begin

    process (clk)

    begin

    if rising_edge(clk) then

    case state is

    when s0 =>

    if (a = 1) then

    state

  • 8/2/2019 Very Good Notes-up2

    75/304

    All circuits should have a reset signal that puts the circuit back into a good initial state. However,

    not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a statemachine, the state machine will probably need to be reset, but datapath may not need to be reset.

    There are standard ways to add a reset signal to both explicit and implicit state machines.

    It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or

    your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.

    Reset with Implicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    With an implicit state machine, we need to insert a loop in the process and test for reset after each

    wait statement.

    Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold.

    architecture moore_implicit of simple is

    begin

    process

    begin

    init : loop -- outermost loop

    z

  • 8/2/2019 Very Good Notes-up2

    76/304

    132 CHAPTER 2. RTL DESIGN WITH VHDL

    Tradeoffs in Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Gray is good for low power applications where consecutivedata values typically differ by 1 (e g

    2.6.1 Dataflow Diagrams Overview 133

    a b c d e f

    +

  • 8/2/2019 Very Good Notes-up2

    77/304

    Gray is good for low-power applications where consecutivedata values typically differ by 1 (e.g.

    no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up

    to a dozen or so states. With more than a dozen states, the extra flip-flops required by one-hot

    encoding become too expensive.

    Custom is great if you have lots of time and are incredibly intelligent, or have deep insight intothe guts of your design.

    Note: Dont care values When we dont care what is the value of a signal we

    assign the signal -, which is dont care in VHDL. Thi s should allow the

    synthesis tool to use whatever value is most helpful in simplifying the Boolean

    equations for the signal (e.g. Karnaugh maps). In the past, some groups in

    E&CE 427 have used- quite succesfuly to decrease the area of their design.However, a few groups fou nd that using - increasedthe size of their design,

    when they were expecting it to decrease the size. So, if you are tweaking your

    design to squeeze out the last few unneeded FPGA cells, pay close attention as

    to whether using - hurts or helps.

    2.6 Dataflow Diagrams

    2.6.1 Dataflow Diagrams Overview

    Dataflow diagrams are data-dependency graphs where the computation is divided into clockcycles.

    Purpose:

    Provide a disciplined approach for designing datapath-centric circuits

    Guide the design from algorithm, through high-level models, and finally to register transfer

    level code for the datapath and control circuitry.

    Estimate area and performance

    Make tradeoffs between different design options

    Background Based on techniques from high-level synthesis tools

    Some similarity between high-level synthesis and software compilation

    Each dataflow diagram corresponds to a basic block in software compiler terminology.

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    Data-dependency graph for z = a + b + c + d + e + f

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    Dataflow diagram for z = a + b + c + d + e + f

    134 CHAPTER 2. RTL DESIGN WITH VHDL

    a b c d e f

    +

    2.6.2 Dataflow Diagrams, Hardware, and Behaviour 135

    2.6.2 Dataflow Diagrams, Hardware, and Behaviour

    Primary Input

  • 8/2/2019 Very Good Notes-up2

    78/304

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    Horizontal lines markclock cycle boundaries

    The use of memory arrays in dataflow diagrams is described in section 2.7.4.

    Primary Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Dataflow Diagram

    i

    x

    Hardware

    i x

    Behaviourclk

    i

    x

    Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Dataflow Diagram

    i

    x

    Hardwarei

    x

    Behaviourclk

    i

    x

    Register Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Dataflow Diagram

    i1

    x

    +

    i2

    Hardware

    i2

    xi1

    +

    Behaviourclk

    i1

    i2

    x

    Combinational-Component Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Dataflow Diagram

    i1

    x

    +

    i2Hardware

    i2

    i1+ x

    Behaviourclk

    i1

    i2

    x

    136 CHAPTER 2. RTL DESIGN WITH VHDL

    2.6.3 Area Estimation

    Maximum number ofblocks in a clock cycle is total number of that component that are needed

    2.6.4 Dataflow Diagram Execution 137

    2.6.4 Dataflow Diagram Execution

    Execution with Registers on Both Inputs and Outputs

  • 8/2/2019 Very Good Notes-up2

    79/304

    Maximum number ofsignals that cross a cycle boundary is total number ofregisters that areneeded

    Maximum number ofunconnected signal tails in a clock cycle is total number of inputs thatare needed

    Maximum number of unconnected signal heads in a clock cycle is total number of outputsthat are needed

    The information above is only for estimating the number of components that are needed. In fact,

    these estimates give lower bounds. There might be constraints on your design that will force you

    to use more components (e.g., you might need to read all of your inputs at the same time).

    Implementation-technologyfactors, suchas the relativesize of registers, multiplexers, and datapath

    components, might force you to make tradeoffs that increase the number of datapath componentsto decrease the overall area of the circuit.

    Of particular relevance to FPGAs:

    With some FPGA chips, a 2:1 multiplexer has the same area as an adder.

    With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cellper bit.

    In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by theamount of combinational logic, not the number of flip-flops.

    In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and

    registers are quite expensive in area.

    Execution with Registers on Both Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    clk

    a

    x1

    x2

    x3

    x4

    x5

    z

    0

    1

    2

    3

    4

    5

    6

    0 1 2 3 4 5 6

    x5

    Execution Without Output Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    clk

    a

    x1

    x2

    x3

    x4

    x5

    z

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5 6

    x5

    138 CHAPTER 2. RTL DESIGN WITH VHDL

    2.6.5 Performance Estimation

    Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    2.6.7 Area / Performance Tradeoffs 139

    2.6.7 Area / Performance Tradeoffs

    one add per clock cycle two adds per clock cycle

  • 8/2/2019 Very Good Notes-up2

    80/304

    Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Performance 1

    TimeExec

    TimeExec = Latency ClockPeriod

    Latency = Number of clock cycles from inputs to outputs

    There is much more information on performance in chapter4, which is devoted to performance.

    Performance of Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Latency: count horizontal lines in diagram

    Min clock period (Max clock speed) limited by longest path in a clock cycle

    2.6.6 Design Analysis

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    num inputs 6

    num outputs 1

    num registers 6

    num adders 1

    min clock period delay through flop and one adder

    latency 6 clock cycles

    p y p y

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    0

    1

    2

    3

    4

    5

    6x5

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    0

    1

    2

    3

    4

    x5

    Note: In the Two-add design, half of the last clock cycle is wasted.

    Two Adds per Clock Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    0

    1

    2

    3

    clk

    a

    x1

    x2

    x3

    x4

    x5

    z

    0 1 2 3 4 5 6

    4

    x5

    140 CHAPTER 2. RTL DESIGN WITH VHDL

    Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    One add per clock cycle Two adds per clock cycle

    2.7. MEMORY ARRAYS AND RTL DESIGN 141

    2.7 Memory Arrays and RTL Design

  • 8/2/2019 Very Good Notes-up2

    81/304

    One add per clock cycle Two adds per clock cycle

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    0

    1

    2

    3

    4

    5

    6

    x5

    a b c d e f

    +

    +

    +

    +

    +

    x1

    x2

    x3

    x4

    z

    0

    1

    2

    3

    4

    x5

    inputs 6 6

    outputs 1 1

    registers 6 6

    adders 1 2

    clock period flop + 1 add flop + 2 add

    latency 6 4

    Question: Under what circumstances would each design option be fastest?

    Answer:

    time = latency * clock period

    compare execution times for both options

    T1 = 6 (Tf + Ta)T2 = 4 (Tf + 2 Ta)

    One-add will be faster whenT1 < T2:

    6 (Tf + Ta) < 4 (Tf+ 2 Ta)6Tf + 6Ta < 4Tf + 8Ta

    2Tf < 2TaTf < Ta

    Sanity check: If add is slower than flop, then want to minimize the number ofadds. One-add has fewer adds, so one-add will be faster when add is slowerthan flop.

    2.7.1 Memory OperationsRead of Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Dataflow DiagramM

    d

    mem(rd)

    aHardware

    WE

    A

    DI

    DOa doM

    clk

    we

    Behaviour

    clk

    a

    d

    a

    M(a)

    d

    we

    do

    -

    -

    Write to Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Dataflow DiagramM

    M

    mem(wr)

    adiHardware

    WE

    A

    DI

    DOaM

    clk

    di

    we

    do

    Behaviour

    clk

    a

    d

    a

    M(a)

    d

    we

    di

    -

    -

    -

    do U

    -

    -

    Dual-Port Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    M

    M

    mem(wr)

    a0di0

    mem(rd)

    a1

    do1

    a0M

    clk

    di0

    we WE

    A0

    DI0

    DO0

    A1 DO1a1 do1

    do0

    clk

    a

    d

    a0

    M(a)

    d

    we

    di0

    -

    -

    -

    -

    aa1

    do0

    -

    -

    dM(a)

    U

    ddo1 -

    142 CHAPTER 2. RTL DESIGN WITH VHDL

    Sequence of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    clk

    we -

    2.7.2 Memory Arrays in VHDL 143

    architecture main of mem_not_hw is

    subtype data is std_logic_vector(7 downto 0);

    type data_vector is array( natural range ) of data;

    signal mem : data vector(31 downto 0);

  • 8/2/2019 Very Good Notes-up2

    82/304

    M

    M

    mem(wr)

    a0

    di0

    mem(rd)

    a1

    do1

    mem(rd)

    do1

    mem(rd)

    do0

    a1

    a0 a

    d1

    a0

    M(a)

    ddi0

    aa1

    do0

    dM(a)

    ddo1 -

    a

    d2

    a

    -

    -

    -

    d1

    d

    dM(a) -

    dM(a)

    ?

    2.7.2 Memory Arrays in VHDL

    2.7.2.1 Using a Two-Dimensional Array for Memory

    A memory array can be written in VHDL as a two-dimensional array:

    subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range ) of data;

    signal mem : data_vector(31 downto 0);

    These two-dimensional arrays can be useful in high-level models and in specifications. However,

    it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some

    synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two-

    dimensional arrays very inefficiently.

    The example below illustrates: lack of interface protocol, combinational write, multiple write

    ports, multiple read ports.

    g _ ( );

    begin

    y

  • 8/2/2019 Very Good Notes-up2

    83/304

    subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range ) of data;

    end;

    entity mem is

    port (

    clk : in std_logic;

    we : in std_logic -- write enable

    a : i n u ns ig ne d( 4 d ow nt o 0) ; - - ad dr es s

    di : in data; -- data_in

    do : out data -- data_out

    );

    end mem;

    architecture main of mem is

    signal mem : data_vector(31 downto 0);

    begin

    do

  • 8/2/2019 Very Good Notes-up2

    84/304

    needs, you can construct your own component from smaller ones.

    WE

    A

    DI

    DO

    WE

    A

    DI

    DO

    NxW NxW

    WriteEn

    Addr

    DataIn[W-1..0]DataIn[2W-1..2]

    Clk

    DataOut[W-1..0]DataOut[2W-1..W]

    Figure 2.4: An N2W memory from NW components

    WE

    A

    DI

    DO

    WE

    A

    DI

    DO

    NxW

    NxW

    WriteEn

    Addr[logN-1..0]

    DataIn

    Clk

    DataOut

    Addr[logN]

    10

    Figure 2.5: A 2NW memory from NW components

    use ieee.std_logic_1164.all;use ieee.numeric_std.all;

    entity ram16x4s is

    port (

    clk, we : in std_logic;

    data_in : in std_logic_vector(3 downto 0);

    a ddr : i n u ns ig ne d( 3 d ow nt o 0) ;

    data_out : out std_logic_vector(3 downto 0)

    );

    end ram16x4s;

    architecture main of ram16x4s is

    component ram16x1s

    port (d : in std_logic; -- data in

    a3, a2, a1, a0 : in std_logic; -- address

    we : in std_logic; -- write enable

    wclk : in std_logic; -- write clock

    o : out std_logic -- data out

    );

    end component;

    begin

    mem_gen:

    for i in 0 to 3 generate

    ram : ram16x1s

    port map (

    we => we,

    wclk => clk,

    ----------------------------------------------

    -- d and o are dependent on i

    a3 => addr(3), a2 => addr(2),

    a1 => addr(1), a0 => addr(0),

    d => data_in(i),

    o => data_out(i)

    ----------------------------------------------

    );

    end generate;

    end main;

    148 CHAPTER 2. RTL DESIGN WITH VHDL

    2.7.2.6 Dual-Ported Memory

    Dual ported memory is similar to single ported memory, except that it allows two simultaneous

    2.7.3 Data Dependencies 149

    Purpose of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    R3 := ......W0

  • 8/2/2019 Very Good Notes-up2

    85/304

    reads, or a simultaneous read and write.

    When doing a simultaneous read and write to the same address, the read will usually not see the

    data currently being written.

    Question: Why do dual-ported memories usually not support writes on both ports?

    Answer:

    What should your memory do if you write different values to the same

    address in the same clock cycle?

    2.7.3 Data Dependencies

    Definition of Three Types of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    There are three types of data dependencies. The names come from pipeline terminology in com-

    puter architecture.

    M[i] :=

    := M[i]

    :=

    M[i]

    :=

    :=

    M[i]

    :=

    M[i]

    :=

    :=

    M[i]

    :=

    Read after Write Write after Write Write after Read

    (True dependency) (Load dependency) (Anti dependency)

    Instructions in a program can be reordered, so long as the data dependencies are preserved.

    R3 := ......

    ... := ... R3 ...

    producer

    consumer

    W1

    R1

    W2

    WAW ordering prevents W0

    from happening after W1

    WAR ordering prevents W2

    from happening before R1

    RAW ordering prevents R1

    from happening before W1

    R3 := ......

    Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specific purpose

    in ensuring that producer-consumer relationships are preserved.

    Ordering of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    M[2]

    M[3]

    M[3]

    M[0]

    :=

    A

    B

    21

    31

    32

    01

    :=

    :=

    :=

    M[2]

    M[0]

    :=

    :=

    M[3] M[2] M[1] M[0]

    30 20 10 0

    M[3]C :=

    21

    Initial Program with Dependencies

    M[2] := 21

    M[3] 31:=

    A := M[2]

    B := M[0]

    M[3] 32:=

    M[0] 01:=

    C := M[3]

    Valid Modification

    M[2] := 21

    M[3] 31:=

    A := M[2]

    B := M[0]

    M[3] 32:=

    M[0] 01:=

    C := M[3]

    Valid (or Bad?) Modification

    Answer:

    Bad modification: M[3] := 32 must happen before C := M[3].

    150 CHAPTER 2. RTL DESIGN WITH VHDL

    2.7.4 Memory Arrays and Dataflow Diagrams

    Legend for Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    2.7.4 Memory Arrays and Dataflow Diagrams 151

    Dataflow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  • 8/2/2019 Very Good Notes-up2

    86/304

    name

    name name name (rd) name(wr)

    Input port Output port State signal Array read Array write

    Basic Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    mem(rd)

    addr

    data

    mem

    mem

    (anti-dependency)

    mem(wr)

    data addrmem

    mem

    data := mem[addr]; mem[addr] := data;

    Memory Read Memory Write

    Dataflow diagrams show the dependencies between operations. The basic memory operations are

    similar, in that each arrow represents a data dependency.

    There are a few aspects of the basic memory operations that are potentially surprising:

    The anti-dependency arrow producing mem on a read.

    Reads and writes are dependent upon the entire previous value of the memory array.

    The write operation appears to produce an entire memory array, rather than just updating anindividual element of an existing array.

    Normally, we think of a memory array as stationary. To do a read, an address is given to the array

    and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to

    see the read and write operations consuming and producing memory arrays.

    Our goal is to support memory operations in dataflow diagrams. We want to model memory oper-ations similarly to datapath operations. When we do a read, the data that is produced is dependent

    upon the contents of the memory array and the address. For write operations, the apparent depen-

    dency on, and production of, an entire memory array is because we do not know which address

    in the array will be read from or written to. The antidependency for memory reads is related to

    Write-after-Read dependencies, as discussed in Section 2.7.3. There are optimizations that can be

    performed when we know the address (Section 2.7.4).

    Algo: mem[wr addr] := data in;data out := mem[rd addr];

    data_out

    mem(wr)

    data_in wr_addr

    rd_addr

    mem

    mem(rd)

    mem

    Read after Write

    Algo: mem[wr addr] := data in;data out := mem[rd addr];

    data_out

    mem(wr)

    data_in wr_addr

    rd_addr

    mem

    mem(rd)

    mem

    Optimization when rd addr = wr addr

    Algo: mem[wr1 addr] := data1;

    mem[wr2 addr] := data2;

    mem(wr)

    mem

    mem(wr)

    data1 wr1_addr

    wr2_addr

    mem

    data2

    Write after Write

    152 CHAPTER 2. RTL DESIGN WITH VHDL

    Algo: mem[wr1 addr] := data1;

    mem[wr2 addr] := data2;

    wr2_addrdata2mem

    2.7.5 Example: Memory Array and Dataflow Diagram 153

    2.7.5 Example: Memory Array and Dataflow Diagram

    data_in wr_addrmem

  • 8/2/2019 Very Good Notes-up2

    87/304

    mem(wr)

    mem(wr)

    data1 wr1_addr

    mem

    Scheduling option when

    wr1 addr = wr2 addr

    Algo: rd data := mem[rd addr];

    mem[wr addr] := wr data;

    mem(wr)

    mem

    mem(rd)

    rd_addr

    wr_addr

    mem

    wr_data

    rd_data

    Write after Read

    Algo: rd data := mem[rd addr];

    mem[wr addr] := wr data;

    mem(wr)

    mem

    mem(rd)

    rd_addr wr_addr

    mem

    wr_data

    rd_data

    Optimization when rd addr = wr addr

    M(wr)

    2

    M(rd)

    M 21 2

    M(wr)

    31 3

    A

    0

    M(rd)

    B M(wr)

    32 3

    M(wr) 3

    01 0

    M(rd)

    CM

    M[2]

    M[3]

    M[3]

    M[0]

    :=

    A

    B

    21

    31

    32

    01

    :=

    :=

    :=

    M[2]

    M[0]

    :=

    :=

    M[3]C :=

    1

    2

    3

    4

    5

    6

    7

    1

    2

    3 4

    5

    6

    7

    Figure 2.6: Memory array example code and initial dataflow diagram

    The dependency and anti-dependency arrows in dataflow diagram in Figure2.6 are based solely

    upon whether an operation is a read or a write. The arrows do not take into account the address

    that is read from or written to.

    In figure2.7, we have used knowledge about which addresses we are accessing to remove unneeded

    dependencies. These are the real dependencies and match those shown in the code fragment for

    figure2.6. In figure2.8 we have placed an ordering on the read operations and an ordering on the

    write operations. The ordering is derived by obeying data dependencies and then rearranging the

    operations to perform as many operations in parallel as possible.

    154 CHAPTER 2. RTL DESIGN WITH VHDL

    M(wr)

    M 21 2

    M(wr)

    31 30

    M(rd) M(wr)

    M 21 2

    M(wr)

    31 30

    M(rd)

    1 1 2

    2.8. INPUT / OUTPUT PROTOCOLS 155

    2.8 Input / Output Protocols

    An important aspect of hardware design is choosing a input/output protocol that is easy to im-

    plement and suits both your circuit and your environment Here are a few simple and common

  • 8/2/2019 Very Good Notes-up2

    88/304

    M(wr)

    2

    M(rd)

    M(wr)

    A

    M(rd)

    B

    M(wr)

    32 3

    M(wr)

    01 0

    3

    M(rd)

    CM

    Figure 2.7: Memory array with minimal dependencies

    M(wr)

    2

    M(rd)

    M(wr)

    A

    M(rd)

    B

    M(wr)

    32 3

    M(wr)

    01 0

    3

    M(rd)

    CM

    3

    2

    1 1 2

    34

    Figure 2.8: Memory array with orderings

    M(wr)

    2

    M(rd)

    M

    21 2

    M(wr)

    31 3

    A

    0

    M(rd)

    B

    M(wr)

    32 3

    M(wr)

    01 03

    M(rd)

    C M

    3

    2

    1 1

    2

    3

    4

    Figure 2.9: Final version of Figure2.6

    Put as many parallel operations into same clock cycle as allowed by resources (one write + one

    read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent opera-

    tions in separate clock cycles.

    plement and suits both your circuit and your environment. Here are a few simple and commonprotocols.

    rdy

    data

    ack

    Figure 2.10: Four phase handshaking protocol

    Used when timing of communication between producer and consumer is unpredictable. The dis-

    advantage is that it is cumbersome to implement and slow to execute.

    clk

    data

    valid

    Figure 2.11: Valid-bit protocol

    A low overhead (both in area and performance) protocol. Consumer must always be able to accept

    incoming data. Often used in pipelined circuits. More complicated versions of the protocol can

    handle pipeline stalls.

    clk

    data_in

    start

    done

    data_out

    Figure 2.12: Start/Done protocol

    A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece

    of data at a time and the time to compute the result is unpredictable.

    156 CHAPTER 2. RTL DESIGN WITH VHDL

    2.9 Design Example: Massey

    Well go through the following artifacts:

    2.9.2 Algorithm 157

    Maximum of two adders

    Small miscellaneous hardware (e.g. muxes) is unlimited

    Maximum of three inputs and one output

    Design effort is unlimited

  • 8/2/2019 Very Good Notes-up2

    89/304

    1. requirements

    2. algorithm

    3. dataflow diagram

    4. high-level models

    5. hardware block diagram

    6. RTL code for datapath

    7. state machine

    8. RTL code for control

    Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    1. Scheduling (allocate operations to clock cycles)

    2. I/O allocation

    3. First high-level model

    4. Register allocation

    5. Datapath allocation

    6. Connect datapath components, insert muxes where needed

    7. Design implicit state machine

    8. Optimize

    9. Design explicit-current state machine

    10. Optimize

    2.9.1 RequirementsFunctional requirements:

    Compute the sum of six 8-bit numbers: o u t p u t = a + b + c + d + e + f

    Use registers on both inputs and outputs

    Performance requirements:

    Maximum clock period: unlimited

    Maximum latency: four

    Cost requirements:

    Design effort is unlimited

    Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex-

    pensive t han a full-adder. A 2:1 mux has three input s whil e a n a dder has only

    two inputs (the carry-in and carry-out signals usually use the special verti-

    cal connections on the FPGA cell). In FPGAs, sharing an adder between two

    signals can be more expensive than having two adders. In a generic-gate

    technology, a multiplexor contains three two-input gates, while a full-adder

    contains fourteen two-input gates.

    2.9.2 Algorithm

    Well use parentheses to group operations so as to maximize our opportunities to perform the work

    in parallel:

    z = ( a + b ) + ( c + d ) + ( e + f )

    This results in the following data-dependency graph:

    a b c d e f

    +

    +

    +

    +

    +

    158 CHAPTER 2. RTL DESIGN WITH VHDL

    2.9.3 Initial Dataflow Diagram

    a b c d

    e f+ +

    2.9.4 Dataflow Diagram Scheduling 159

    Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Original parallel Parallel after scheduling

    a b c d e f a b c d

  • 8/2/2019 Very Good Notes-up2

    90/304

    z

    e f+

    +

    +

    +

    +

    This dataflow diagram violates the require-

    ment to use at most three inputs.

    2.9.4 Dataflow Diagram Scheduling

    We can potentially optimize the inputs, outputs, area, and performance of a dataflow diagram by

    rescheduling the operations, that is allocating the operations to different clock cycles.

    Parallel algorithms have higher performance and greater scheduling flexibility than serial algo-

    rithms

    Ser