vhdl implementation of pipelined dlx microprocessor ignatius

VHDL IMPLEMENTATION OF PIPELINED DLX MICROPROCESSOR

IGNATIUS EDMOND ANTHONY

UNIVERSITI TEKNOLOGI MALAYSIA

iii

For my beloved parents, sisters and friends, and not forgetting

my dearest partner-in-crime, Sheena.

iv

ACKNOWLEDGEMENT

I would like to extend my sincerest gratitude and appreciation to anyone and

everyone who has contributed explicitly or implicitly towards the success of this

project entitled “VHDL Implementation of Pipelined DLX Microprocessor”.

Acknowledgment is particularly given to my project supervisor, Associate Professor

Muhammad Mun’im bin Ahmad Zabidi, who despite his tight schedule always

makes time to oversee the progress of this project besides offering advice on how to

proceed further whenever hindrances are encountered.

Finally, a big thank you to my family, who are always by my side offering me

moral support, and to Sheena, who was relentlessly there as a shoulder to lean on in

my darkest hours.

v

ABSTRACT

The 32-bit load/store DLX processor architecture is a generic RISC processor

designed by Hennessy and Patterson for pedagogical purposes. The DLX processor

design abstracts many features of general-purpose commercial processors, and is a

well-understood computer architecture, providing a good architectural model for

study, not only because of the popularity of this type of machine, but also because it

is easy to understand. Utilizing open source hardware such as the DLX core yields

the apparent advantage of free-for-all distribution as well as having source codes that

are is available and open, allowing for source code modification at-will. This project

aims to continue previous work on integration of the DLX core by adding instruction

pipelining which was excluded from the previous project’s scope due to complexity

and time limitations. Instruction execution speedup and performance was left on the

table to be dealt with in future work. Since the DLX microprocessor was, by nature,

a 5-stage pipelined microprocessor, it can be expected that the core’s performance on

instruction execution can be sped up with a pipeline implementation. Comparison

between the non-pipelined and pipelined DLX were also performed to verify this

instruction execution speedup expectation.

vi

ABSTRAK

Senibina pemproses DLX merupakan suatu pemproses generik RISC 32-bit

yang direkacipta oleh Hennnesy and Patterson bagi tujuan peyelidikan and

pendidikan. Senibina pemproses DLX merangkumi pelbagai ciri-ciri and fungsi

pemproses umum di pasaran, dan bukan sahaja merupakan senibina komputer yang

mudah difahami, tetapi juga amat popular. Menggunakan pemproses sumber terbuka

atau open-core seperti mesin DLX ini memberi kelebihan dalam tersedianya kod-kod

sumber secara terbuka yang membenarkan dan memudahkan pengubahsuaian untuk

keperluan projek. Matlamat projek in adalah untuk meneruskan projek sebelumnya

di mana pemproses DLX dan Wishbone interface diintegrasikan, tetapi dengan

menambah fungsi pipelining untuk pemprosesan suruhan. Dengan penambahan ciri

ini, adalah dijangka bahawa tempoh pemprosesan suruhan dapat disingkatkan

memandangkan cara pemprosesan suruhan dalam mesin DLX dilakukan dalam lima

peringkat. Perbandingan prestasi pemproses DLX sebelum and selepas implementasi

ciri pipelining turut dilaksanakan dalam projek ini untuk mengesahkan jangkaan

awal.

vii

TABLE OF CONTENTS

CHAPTER TITLE PAGE

DECLARATION ii

DEDICATION iii

ACKNOWLEDGEMENT iv

ABSTRACT v

ABSTRAK vi

TABLE OF CONTENTS vii

LIST OF TABLES x

LIST OF FIGURES xi

LIST OF ABBREVIATIONS xiii

LIST OF APPENDICES xiv

1 PROJECT OVERVIEW 1

1.1 Background 1

1.2 Objectives 3

1.3 Scope of Work 3

1.4 Expected Results 4

1.5 Report Layout 5

2 LITERATURE REVIEW 6

2.1

2.2

Processor Selection Considerations of Systems-On-

Chip

DLX Implementations

6

8

2.2.1 ASPIDA DLX Project 8

2.2.1 University of Stuttgart DLX Project 11

2.3 Pipeline Design Considerations 12

viii

3 DLX PROCESSOR ARCHITECTURE 15

3.1 Overview 15

3.2 External View of the DLX 15

3.2.1 The DLX Interface 15

3.2.2 Memory Interface and Access 16

3.2.3 Reset 18

3.2.4 Halt 19

3.2.5 Error 19

3.3 The DLX Programming Model 19

3.3.1 Accessible Registers 19

3.3.2 The DLX Instruction Format 20

3.3.3 The DLX Instruction Set 20

3.4 Internal Structure of the DLX 25

3.4.1 The Datapath 26

3.4.2 The Control Unit 28

3.4.3 The Basic Execution Steps 29

4 DESIGN WORKFLOW, METHODOLOGY AND

TOOLS

32

4.1 Design Workflow 32

4.2 Tools 34

4.2.1 Altera Quartus II 6.0 Web Edition 34

4.2.2 VHDL 36

5 PIPELINED DLX COMPONENT DESIGN 38

5.1 Pipelined DLX Overview 38

5.2 Pipelined Datapath 39

5.2.1 Load / Store Instruction 40

5.2.2 Arithmetic/Logic Instruction 40

5.2.3 Test and Set Instructions 41

5.2.4 Branch/Jump Instructions 41

5.3 Redesigned Control Unit 42

5.3.1 Instruction Fetch (IF) 43

ix

5.3.2 Instruction Decode (ID) 44

5.3.3 Execution Stage (EXE) 45

5.3.4 Memory Stage (MEM) 46

5.3.5 Write Back (WB) 47

6 RESULTS AND PERFORMANCE ANALYSIS 48

6.1 Overview 48

6.2 Functional Validation 48

6.3 Gate Count and Frequency Statistics 51

6.4 Instruction Execution Speed-Up 52

7 CONCLUSION AND FUTURE WORK

RECOMMENTATIONS

55

7.1 Conclusion 55

7.2 Recommendations for Future Work 56

REFERENCES

59

Appendix A 59-63

x

LIST OF TABLES

TABLE NO. TITLE PAGE

2.1 Open Cores Architecture Comparison 11

3.1 DLX arithmetic and logic instructions 22

3.2 DLX Test and Set instructions 23

3.3 DLX Branch instructions 24

3.4 DLX special instructions 24

3.5 DLX load-store instructions 25

3.6 ALU Operations 28

6.1 Non-pipelined versus Pipelined DLX 55

xi

LIST OF FIGURES

FIGURE NO. TITLE PAGE

2.1 DLX Programming Model 8

2.2 ASPIDA DLX Instruction Layout 9

2.3 Supported Integer Instructions 10

2.4 De-synchronized DLX datapath 11

2.5 Classic 5-stage instruction pipeline 14

3.1 External interface of the DLX 16

3.2 Byte positions in a DLX word (big-endian) 16

3.3 Memory Operation Modes 17

3.4 Memory Read Access 17

3.5 Memory Write Access 18

3.6 DLX instruction formats 20

3.7 Internal structure of the DLX 26

3.8 DLX datapath (non-pipelined) 27

3.9 Structure of the DLX Control Unit 29

4.1 Design Methodology Workflow 32

4.2 Quartus II Design Flow 35

5.1 DLX Internal Structural View 38

5.2 DLX Pipelined Datapath 39

xii

5.3 Control Unit for Pipelined DLX 42

5.4 Instruction Fetch (IF) Block Diagram 43

5.5 Instruction Decode (ID) Block Diagram 43

5.6 Execute (EXE) Block Diagram 44

5.7 Memory (MEM) Block Diagram 45

5.8 Write-Back (WB) Block Diagram 46

6.1 Simulation Waveform 49

6.2 Gate Count Snapshot 50

6.3 Redesigned DLX Fmax 51

6.4 Non-pipelined DLX execution waveform 52

6.5 Pipelined DLX execution waveform 52

xiii

LIST OF ABBREVIATIONS

ALU - Arithmetic-Logic Unit

AMBA - ARM Bus Architecture

ARM - Advanced RISC Machine

ASIC - Application-Specific Integrated Circuit

CLK - Clock

CPU - Central Processing Unit

CAM - Content Addressable Memory

EXE - Execution stage

FPGA - Field-Programmable Gate Array

FSM - Finite State Machine

GPR - General Purpose Register

IAR - Instruction Address Register

ICR - Interrupt Control Register

ID - Instruction-Decode

IF - Instruction-Fetch

ISA - Instruction Set Architecture

MAR - Memory Address Register

MDR - Memory Data Register

PC - Program Counter

RISC - Reduced Instruction Set Computing

RW - Read/Write

TBR - Trap Branch Register

VHDL - Very-High-Speed-Integrated-Circuit Hardware Description

Language

VLSI - Very Large Scale Integration

WB - Write-Back

xiv

LIST OF APPENDICES

APPENDIX TITLE PAGE

A VHDL Source Code for Structural DLX Core 59

CHAPTER 1

PROJECT OVERVIEW

1.1 Background

The 32-bit load/store DLX processor architecture is a generic RISC processor

designed by Hennessy & Patterson for pedagogical purposes. The DLX processor

design abstracts many features of general-purpose commercial processors, and is a

well-understood computer architecture.

The DLX provides a good architectural model for study, not only because of

the popularity of this type of machine, but also because it is easy to understand. Like

most load/store machines, the DLX emphasizes a simple load/store instruction set,

design for pipelining efficiency, an easily decoded instruction set and efficiency as a

compiler target.

The Wishbone Bus is an open source hardware computer bus intended to let

the parts of an integrated circuit communicate with each other. The aim is to allow

the connection of differing cores to each other inside of a chip or system-on-chip.

Utilizing open source hardware such as the DLX core and Wishbone bus (or

its competitor – the AMBA bus) yields benefits which include solutions for most of

the problems associated with proprietary cores. Besides the apparent advantage of

free-for-all distribution, utilizing open source hardware standards and open

microprocessor cores tout:

2

• Each core will have a larger user base, which will ensure better support, better

documentation and better implementation examples to work from.

• The source is available, so any developer can find out what he or she needs to

know about the core.

• Eventually, as cores and standards for them are developed, cores will become

more standards-compliant than proprietary cores

• Allows for source code modification at-will, which enables designers to fine-

tune and tweak any design for any design constraint – gate count, performance,

power, etc.

While previous work on integration of the DLX core and Wishbone bus

interface has been undertaken and completed, instruction pipelining was excluded

from the project’s scope due to complexity and time limitations. A DLX

microprocessor core with a non-pipelined instruction execution data path was used to

showcase the microprocessor core-Wishbone bus integration functionality on the

FPGA.

Since the previous project’s focus was on functionality, instruction execution

speedup and performance was left on the table to be dealt with in future work. Since

the DLX microprocessor was, by nature, a five-stage pipelined microprocessor, it can

be expected that the core’s performance on instruction execution can be sped up with

a pipeline implementation.

1.2 Objectives

The objective of this project is to design a five-stage pipelined DLX

microprocessor for the purpose of instruction execution speedup and performance

improvement.

As part of the project, the performance of instruction execution speedup of

the enhanced DLX processor with pipelined-instruction execution versus the non-

3

pipelined DLX will be evaluated and analyzed using a predetermined test suite which

will consist of several small programs.

Finally, exploration and investigation will be done on several design-for-test

(DFT) feature integrations into the DLX-Wishbone system for improved system

debug-ability as future work.

1.3 Scope of Work

This project is focused on the incremental enhancement of an existing DLX

processor design with Wishbone bus interface integration in UTM [2], to modify the

data path unit and controller unit for pipelined instruction execution in five stages:

Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access

(MEM) and Write-Back (WB).

The reference source code for this project is referenced from the DLX project

by the University of Stuttgart, Germany which implements a non-pipelined, non-

synthesizeable flavour of the DLX processor.

Implementation of the DLX enhancements would involve coding in hardware

description language VHDL. Altera’s Quartus 6.1 Web Edition is the tool of choice

for design entry, logic synthesis (compilation) and simulation.

In the work flow of the project, the main emphasis would be on adding the

pipelining capability into the DLX processor’s datapath unit as well as any

incremental changes in the control unit to support the pipelined execution. The

Wishbone bus interface integration into the design will be used as-is from the

previous project with the expectation that any previous issues have been resolved.

4

While the eventual target implementation of the design would be in ASIC,

this project’s implementation level will only be restricted to functional and timing

simulation within Altera’s Quartus software.

As a final outcome, the implementation is validated and verified for

functionality correctness on through simulation. Performance analysis and evaluation

would be carried out using a predetermined suite of small programs to be executed

on the integrated DLX system.

1.4 Expected Results

The pipelined DLX processor is successfully designed and simulated in

Altera’s Quartus software. The processor implementation is validated and verified

using FFT computation task or other simple sorting programs.

The enhanced pipelined-DLX will offer at least 1.5X instruction execution

speedup versus non-pipelined core measured by CPU cycle time required to

complete predefined computation task-list on both processors. This will serve as the

baseline expectation to justify that the additional logic overhead incurred due to the

pipelined-datapath translates to real-world performance speedup.

5

1.5 Report Layout

The layout of this report would be as follows:-

Chapter 1: Brief overview of project, including objectives and work scope.

Chapter 2: Literature review of other existing DLX projects undertaken by

other universities, design considerations for pipelining, and

microprocessor selection considerations.

Chapter 3: Overview of the DLX processor and instruction set architecture,

pipelining concepts and design considerations.

Chapter 4: Design workflow, methodology and tools.

Chapter 5: Pipelined DLX microprocessor components and design.

Chapter 6: Results and performance analysis.

Chapter 7: Conclusion and future work recommendations.

CHAPTER 2

LITERATURE REVIEW

2.1 Processor Selection Considerations for Systems-On-Chip

In selecting the DLX as the baseline core for this project’s implementation

and processor redesign for pipelining support, several open-source processors

designs were evaluated based on several criteria

• Processor and instruction set architecture complexity

• Availability of documentation

• Compiler availability

Table 2.1 highlights comparison made between the DLX, Leon and

OpenRISC microprocessors.

The DLX microprocessor is chosen as the CPU for this project as was the chose

made in the previous project undertaken. The main motivations remain; it’s free and the

DLX architecture complexity is lower as compared to the other two open source

processors, which utilize windowed register architecture versus the load-store

architecture of the DLX.

7

Table 2.1: Open Cores Architecture Comparison

DLX Leon OpenRISC

Windowed

Registers

No Yes No

Number of general

purpose registers

32 40 to 520

(136 typical)

32 GPRs + many system

control registers

Most similar to MIPS SPARC None

Complexity Low High High

Reference

Document

Hennessy &

Patterson text

IEEE 1754 Opencores.org

Windowed register is an architecture where more than 32 general purpose

registers is used, and in some designs as much as 256 registers. But at any time, only 32

registers are visible to the programmer. To use other registers as well, the programmer

has to ‘slide’ up or down a pointer that points a predetermined window range at a time.

This architecture is advantageous as there is no need of using a stack when

calling subroutines, as all information can be stored in the register. While the

disadvantage is that more decoding circuit is required to implement this windowed

function, which makes the design to be more complex compared to those using non-

windowed system.

The DLX instruction set architecture is akin to the MIPS ISA in many ways,

consisting of a full set of general purpose programmer-accessible register as depicted

in Figure 2.1. Both employ a load-store architecture.

Figure 2.1: DLX Programming Model

R0

R1

R2

R31

PC

ICR

IAR

TBR

8

The DLX also has several other registers namely the TBR, ICR and IAR

which will be dealt with in more detail in Chapter 3. One particular point of interest

is the absence of a dedicated status register in the DLX. Instead, all set and test

instructions utilize register-0 (R0) to store the result flag (either 0 or 1 denoting true

or false). More details will be discussed in Chapter 3 and all set/test instructions are

listed in Table 3.2. There are only two different branching instructions that exist in

the DLX. Coupled with this unique implementation of set and test instructions, they

are able to handle a variety of different branching conditions, indirectly reducing

pipeline stalls due to branch hits.

2.2 DLX Implementations

There are several open-source implementations of the DLX available today.

Among the distinguishing traits of these flavours of the DLX implementation include

pipelined or non-pipelined, synthesizable versus non-synthesizeable code, as well as

synchronous versus asynchronous design implementations. In this report, we delve

into two such DLX implementations, namely the ASPIDA DLX project and the

University of Stuttgart DLX project.

2.2.1 ASPIDA DLX Project

The ASPIDA open-source DLX supports the full DLX integer ISA. Floating

point operations are not supported in the current version of the processor. The

ASPIDA DLX contains two memory interfaces, following the original DLX model,

which support byte, half-word and word transfers. Branches follow the conventional

RISC semantics and require a branch delay slot, i.e. the instruction followed by the

branch is always executed. A vectored interrupt co-processor, including an interrupt

cause register and an exception program counter, is included.

9

This European Union-funded ASPIDA (ASynchronous oPen-source Ip of the

DLX Architecture) project has the goal of promoting the adoption of asynchronous

design, by delivering an open-source asynchronous synthesizable DLX processor

core, supporting the full integer Instruction Set Architecture, interrupts and byte

addressable memory. It will also deliver an asynchronous interconnect fabric based

on the CHAIN architecture, developed by the University of Manchester [3].

The ASPIDA DLX supports the three operation types of the DLX ISA:

• I-type: logic/arithmetic operations performed between a register and an

immediate value. Conditional branches are also I-type instructions.

• R-type: logic/arithmetic operations performed between two registers. Load

and store instructions are also R-type instructions.

• J-type: jump and jump-and-link instructions

The instruction layout for each instruction type is shown in the Figure 2.2.

Figure 2.2: ASPIDA DLX Instruction Layout

10

The integer subset of the DLX ISA is shown in the figures following.

Supported instructions are ticked in the following Figure 2.3.

Figure 2.3: Supported Integer Instructions

In this implementation, the DLX is de-synchronized. The global clock is

removed and is replaced by handshaking controllers. The flip-flops are replaced by

latch pairs. The figure below shows the datapath of the de-synchronized DLX. As

can be seen, the latches that separate the datapath stages are locally clocked by

11

controllers, which are responsible for producing the appropriate signals so that the

data move safely from one pipeline stage to the next.

Figure 2.4: De-synchronized DLX datapath

2.2.2 University of Stuttgart DLX Project

The University of Stuttgart DLX project is part of the VLSI Design Course

by Gumm [4] completed in December 1995 [5]. The DLX processor design abides

by Hennessey and Patterson’s original DLX RISC machine proposal.

In this project, only a subset of the original instruction set was implemented.

While the original DLX instruction set contained, among others, instructions for

signed and unsigned integer arithmetic and floating point arithmetic, support for

floating point arithmetic was not implemented in this design.

On the other hand, their processor model is extended by some features:

interrupt- and exception-handling, three different operation modes (supervisor, user,

and error) and one additional addressing mode. However, pipelining has not been

12

implemented. The interrupt and exception handling was derived from the DLXm

model which was done by the Alliance design suite and to some limited extent, from

the SPARC architecture. The memory addressing was also derived and simplified

from the SPARC definition. The timing models for the bus transactions have been

derived from the DP32 processor model.

The University of Stuttgart’s processor architecture was called the “DLXS”

where “S” stands for ‘Stuttgart’ version.

The Stuttgart processor design – DLXS – was selected for implementation in

the previous DLX project in UTM. Since DLXS was not synthesizeable due to the

absence of a reference clock (processor control was done through control signals

from the test bench), the source codes were modified to support a global reference

clock.

More details on the DLXS external and internal interface will be covered

more extensively in the following chapter on the DLX architecture overview.

2.3 Pipelining Design Considerations

In computing, a pipeline is a set of data processing elements connected in

series, so that the output of one element is the input of the next one. The elements of

a pipeline are often executed in parallel or in time-sliced fashion; in that case, some

amount of buffer storage is often inserted between elements.

An instruction pipeline is a technique used in the design of computers and

other digital electronic devices to increase their performance. Pipelining reduces

cycle time of a processor and hence increases instruction throughput, the number of

instructions that can be executed in a unit of time. But pipelining does not help in all

cases. There are several disadvantages associated. An instruction pipeline is said to

13

be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that

is not fully pipelined has wait cycles that delay the progress of the pipeline.

Pipelining doesn't decrease the time for a single datum to be processed; it

only increases the throughput of the system when processing a stream of data. at the

same time, a pipelined system typically requires more resources (circuit elements,

processing units, computer memory, etc.) than one that executes one batch at a time,

because its stages cannot reuse the resources of a previous stage. Moreover,

pipelining may increase the time it takes for an instruction to finish.

One key aspect of pipeline design is balancing pipeline stages. Another

design consideration is the provision of adequate buffering between the pipeline

stages — especially when the processing times are irregular, or when data items may

be created or destroyed along the pipeline.

The advantage of pipelining is that the cycle time of the processor is reduced,

thus increasing instruction bandwidth in most cases. However, the advantages of not

pipelining include:

• The processor executes only a single instruction at a time. This prevents

branch delays (in effect, every branch is delayed) and problems with serial

instructions being executed concurrently. Consequently the design is simpler

and cheaper to manufacture.

• The instruction latency in a non-pipelined processor is slightly lower than in a

pipelined equivalent. This is due to the fact that extra flip flops must be added

to the data path of a pipelined processor.

• A non-pipelined processor will have a stable instruction bandwidth. The

performance of a pipelined processor is much harder to predict and may vary

more widely between different programs.

Many designs include pipelines as long as 7, 10 and even 31 stages (like in

the Intel Pentium 4). The Xelerator X10q has a pipeline more than a thousand stages

14

long. The downside of a long pipeline is when a program branches, the entire

pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch

predicting itself can end up exacerbating the problem if branches are predicted

poorly. In certain applications, such as supercomputing, programs are specially

written to rarely branch and so very long pipelines are ideal to speed up the

computations, as long pipelines are designed to reduce clocks per instruction (CPI).

Branching happens constantly, however, in many common applications such as

office software, significantly reducing the speed gain of pipelining.

The higher throughput of pipelines falls short when the executed code

contains many branches: the processor cannot know where to read the next

instruction, and must wait for the branch instruction to finish, leaving the pipeline

behind it empty. After the branch is resolved, the next instruction has to travel all the

way through the pipeline before its result becomes available and the processor

appears to "work" again. In the extreme case, the performance of a pipelined

processor could theoretically approach that of an un-pipelined processor, or even

slightly worse if all but one pipeline stages are idle and a small overhead is present

between stages.

Because of the instruction pipeline, code that the processor loads will not

immediately execute. Due to this, updates in the code very near the current location

of execution may not take effect because they are already loaded into the Prefetch

Input Queue. Instruction caches make this phenomenon even worse. This is only

relevant to self-modifying programs.

Figure 2.5: Classic 5-stage instruction pipeline

CHAPTER 3

DLX PROCESSOR ARCHITECTURE

3.1 Overview

The DLX processor was first defined as a hypothetical RISC machine with a

simple 32-bit load/store architecture. It is well suited for teaching purposes because

of its simple instruction set, its single addressing mode, the simple decoding of its

instruction set and its easily understandable architecture. However, this architecture

still demonstrated all the major features of the RISC principle.

3.2 External View of the DLX

The following subsections present the high-level overview of the DLX

processor architecture, its external interfaces, and its underlying submodules.

3.2.1 The DLX Interface

The external interface of the DLX has an address bus (ADDR) and a

bidirectional data bus (DATA), both 32-bits wide. The output lines RW, ENABLE

and READY are needed to handle memory access. The DLX also has an

asynchronous reset input (RESET) and a disable input (HALT). A two=phase, non-

16

overlapping clock signal is expected at the clock inputs PHI1 and PHI2. The

ERROR output indicates that the processor has reached an unrecoverable error state.

Figure 3.1: External interface of the DLX

3.2.2 Memory Interface and Access

A DLX word is 32-bits long. Memory is byte addressable with a 320bit

address in big-endian mode. All memory references are through loads and stores

between memory and the general-purpose registers. Accesses can be to byte, half-

word or work lengths. In the case of half access, the address must be “half-aligned”.

All instructions are 32-bits wide and must we work-aligned.

Figure 3.2: Byte positions in a DLX word (big-endian)

17

Addressing is only possible in two modes: immediate-addressing and

register-indexed addressing. During read/write accesses, the four-bit ENABLE

output and the RW output determine the kind of bus transaction as stated in the

following table:

Figure 3.3: Memory Operation Modes

The timing of the bus read transactions are shown below. During an idle state

(Ti), the processor places the memory address on the address bus (after the rising

edge of phi2) to start the transaction. In the next state (T1), the processor activates

the RW and ENABLE lines and waits for the memory to access the data. If the

memory has completed its operation during this state (T1) or the following state

(R2), it asserts ready and the processor completes the transaction by resetting the

ENABLE lines (after the rising edge of phi1) and continues with idle states.

Otherwise, the memory leaves the READY line false and the processor repeats T2

states until it detects READY to be true.

18

Figure 3.4: Memory Read Access

The timing of the bus write transaction is shown in the next figure. During an

idle state (Ti), the processor places the memory address on the address bys to start

the transaction. In the next state (T1) the processor places the data on the data bus.

In the following T2 state, the processor activates the RW and ENABLE lines. If the

memory has completed its operation during this state, it asserts ready and the

processor completed the transaction by resetting the ENABLE lines and continues

with idle states. Otherwise, the memory leaves the READY line false and the

processor repeats T2 states until it detects READY to be true.

Figure 3.5: Memory Write Access

19

3.2.3 Reset

An activation of the RESET-input of the DLX changes the port direction of

the bi-directional data-bus DATA to input and the outputs ENABLE and RW are set

to zero. The registers affected by the reset are described in a later section.

3.2.4 Halt

Before fetching an instruction, the DLX processor checks the HALT input. If

the input is active, the processor changes to the inactive state. All output ports are

set to high impedance state and the DATA bus is switched to input. The processor

stays inactive until the HALT signal is set back to zero and continues afterwards with

the normal instruction fetch.

3.2.5 Error

In the case of error detection, the processor stops any further operation and

changes the error state. This is indicated by the setting of the output ERROR to high.

All output ports are set to high impedance state and the DATA bus is switched to

input. The processor has to be restarted or reset.

20

3.3 The DLX Programming Model

3.3.1 Accessible Registers

The following registers are programmer-accessible:

• R0, …, R31: 32 general purpose registers (GPR) of 32-bits wide. The

value of R0 is always 0, i.e. the register can only be read and a write

does not change its content. Register R31 serves also as an address

storage when executing call instructions.

• ICR: The interrupt control and exception register. The content of this

register can be moved to a general-purpose register and vice-versa

• IAR: the interrupt address register: this register stores the address of

the next instruction when an interrupt is executed. The content of this

register can be moved to a GPR and vice versa

• TBR: the trap base register. This register sores the base address of the

interrupt and exception handling routines in the memory. The content

of this register can be moved to a GPR and vice versa.

Other registers are not accessible to the programmer. Among these registers

are the program counter (PC) and the instruction register (IR). The DLX has a load-

store architecture, that is all arithmetic and logic operations are limited to between

GPRs and the memory access is done via those registers as well.

3.3.2 The DLX instruction format

All DLS instructions are 32-bits wide with a 6-bit primary opcode. There are

only three different instruction formats: the J-type, the I-type and the R-type.

21

Figure 3.6: DLX instruction formats

I-type instructions are used to encode the load-store instructions with

immediate displacement. In this case, RS1 encodes the GPR which holds the

memory address, RD encodes the GPR to read from or write in respectively, and the

16-bit sign extended immediate is the displacement value. Furthermore, this

instruction format encodes the conditional branch instructions.

R-type instructions encode the register to register ALU operation where the

bit field func encodes the ALU operation: RD � RS1 func RS2 as well as the

register indexed load/store instructions. In this case, RS2 encodes the PGR which

holds the memory address, RD encodes the GPR to read from or write to, and RS2

encodes the register which holds the displacement value.

J-type instructions encode only the unconditional branches where the 26-bit

singed-extended immediate is added to the program counter (PC) and two special

instructions.

22

3.3.3 The DLX instruction set

The DLX possesses 52 instructions which can be classified into:

• 18 arithmetic and logic instructions

• 12 test instructions

• 6 branch instructions

• 12 memory access instructions

• 4 special instructions

All load/store instructions exist in two formats: addressing with immediate

displacement (16-bit) and register indexed:

• LW Rd, Rs2 (Rs1) (Rd � Mem[Rs1 + Rs2])

• LW.I Rd, I(Rs1) (Rd � Mem[Rs1 + I])

The arithmetic and logic operations, except LHI and NOP, are executable two

formats: register-to-register and register-immediate:

• ADD Rd, Rs1, Rs2 (Rd � Rs1 + Rs2)

• ADD.I Rd, Rs1, I (Rd � Rs1 + I)

The complete listing is shown in Table 3.1, 3.2 and 3.3. The 16-bit

immediate value for I is signed-extended for arithmetic instructions and zero

extended for logical instructions. The instruction LHI allows to write a 16-bit

immediate value in the upper half-word of a GPR whereas the lower half word is set

to zero.

23

Table 3.1: DLX arithmetic and logic instructions

There are 12 test instructions which test a relation between either the contents

of two GPRs or the content of one GPR and a 16-bit sign-extended immediate value.

If the result is true, the destination register is set to 1, otherwise it is set to 0.

24

The DLX possesses only two conditional branch instructions: branch on

equal to zero and branch on not equal to zero. Besides the two conditional branch

instructions, there are four unconditional branch instructions.

Table 3.2: DLX test and set instructions

25

Table 3.3: DLX branch instructions

The DLX also possesses four special instructions. Two decode moves

between special registers i.e. ICR and IAR, or the TBR, and the GPR. Another 2 are

for interrupt and exception handling.

Table 3.4: DLX special instructions

26

The last class of instructions are the load-store instructions listed in Table 3.5.

Table 3.5: DLX load-store instructions

3.4 Internal Structure of the DLX

The internal structure for the DLX is quite simple and easy to comprehend. It

consists of two main components: the datapath and the controller. The datapath

executed all operation on data. It contains all the registers, the ALU, and the internal

data busses for the interconnect. The controller generates the control sequence of the

27

control signals which are necessary for the correct flow of the data in the datapath.

The signals, exchanged by the two main components are separated into five classes.

Figure 3.7: Internal structure of the DLX

3.4.1 The Datapath

The structure of the non-pipelined DLX datapath is depicted in the following

figure. The general-purpose registers R0-R31 are contained in the register file. The

functions of ICR, IAR and TBR have already been mentioned. The PC holds the

address of the instruction which is to be executed next while the IR holds the current

instruction. The memory data register (MDR) contains the data to be written into the

memory in case of a write access or the data read form the memory in the case of a

read access. The memory access register (MAR) contains the address of the

concerned memory location. The MAR can also be used as a temporary register to

store intermediate results of a calculation.

The processor uses three internal busses: the source1 bus (S1), the source2

bus (S2) and the destination bus (Dest). The fundamental operation of the datapath is

reading operands from the register file, operation on the in the LAU, and then writing

the result back into the register file. Since the register file does not need to be read

28

and written every clock cycle, this sequence is broken into multiple clock cycles to

allow for shorter clock periods.

Figure 3.8: DLX datapath (non-pipelined)

29

The ALU can perform the following operations as denoted in Table 3.6:

Table 3.6: ALU operations

3.4.2 The Control Unit

The structure of the DLX control unit for the non-pipelined datapath is

depicted in the following figure. It consists of the central finite state machine (FMS),

the instruction register (IR), 2 instruction decoders and additional logic.

The FSM has 64 different states which change with the rising edge of phi1. It

generated seven groups of control signals which are transmitted to the datapath for its

operation: rs1_enable, rs2_enable, dest_enable, alu_ctrl, reg_file_ctrl, memory_ctrl

and various_ctrl. The signal groups rs1_enable and rs2_enable are composed of the

output enable signals for all registers which are connected to the S1 and S2 bus. The

dest_enable signals enable the load from the Dest bus for all registers which are

connected to it. The alu_ctrl signal selects the required ALU function, the

reg_file_ctrl signal controls the load from the C register into the register file and the

load form it into the A and B registers. The memory_ctrl signal is used for

30

controlling the memory operations. Finally there are some additional control signals

which are grouped in various_ctrl.

The instruction decoder DEC1 decodes all the instructions except for the

memory instructions which are decoded using DEC2. The decoder DEC3 is used to

generate control signals for the generation of the register file addresses. The IR

contains the actual instruction, and it has two outputs connected to the S1 and S2

busses for the immediate values.

Figure 3.9: Structure of the DLX Control Unit

3.4.3 The Basic Execution Steps

Instructions in the DLX instruction set can be broken into five basic steps:

fetch, decode, execute, memory access, and write-back. This is what allows the

processor to enable pipelining of the instructions execution although instructions

may also be executed in sequence, one at a time to completion before the start of the

next instruction. Each step may take one or several clock cycles.

31

1. Instruction fetch step:

IR ← Mem[PC];

Fetch instructions from memory

2. Instruction decode and operand fetch step:

A ← Rs1; B ← Rs2; PC ← PC + 4;

Decode the instruction. Access the register file to read the registers. This

can be done in parallel with the decoding because the source registers have

always the same location in the instruction formats (fixed-field decoding).

Thus the A and B registers are loaded always in this step, regardless if their

contents will be used afterwards or not. Increment the PC to point to the next

instruction.

3. Execution step:

a. Memory reference:

MAR ← A + (IR16)16##IR16:31;

MDR ← Rd

The ALU is adding the operands to form the effective address, the

MDR is loaded for a store

b. ALU instruction:

C ← A op B

The ALU is performing the specified operation the result is stored in

C.

c. Branch/Jump:

Cond ← A op 0 (conditional branch instruction)

PC ← PC + (IR6)6##IR26:31;

In case of a conditional branch, the ALU performs a relative

operation. In the case of an unconditional jump, the ALU is adding

the two operands to form the effective branch address which is stored

in the PC. In the case of a jump-and-link instruction, the PC is saved

in the IAR before the jump is taken.

32

4. Memory access / branch completion step:

a. Memory reference:

MDR ← Mem[MAR]; C ← MDR (load instruction)

Mem[MAR] ← MDR; (store instruction)

b. Conditional branch:

If (cond) PC ← PC + (IR16)16##IR16:31;

In the case of a conditional branch, add the two operands to form the

effective branch address and store the result in the PC if cond is true.

5. Write back step:

Rd ← C

Write the result into the register file.

CHAPTER 4

DESIGN WORKFLOW, METHODOLOGY AND TOOLS

4.1 Design Workflow

The first stage of the project is focused on literature review and study of the

DLX processor architecture. This is necessary in order to understand the VHDL

coding of the DLX processor.

Study of DLX

Architecture

Study of pipelining

concepts

Datapath redesign

Control unit redesignPipeline Control Unit

Design

Reproduce previous

project results

FPGA Implementation

& validation

Module simulations

and validation

Full simulation and

design validation

DLX modules

integration

Future work

exploration

Performance

evaluation & analysis

Figure 4.1: Design Methodology Workflow

34

The next key ingredient to the project would be pipelining. Hence, a study of

pipelining concepts, which includes pipeline design consideration, limitations,

advantages and disadvantages were investigated. Pipelining in the context of the

DLX processor was also looked into.

Using the previous DLX processor work, several instructions were re-

simulated in the Quartus software to ascertain the functionality of the previous

design as well as familiarize with the project’s source code. At this stage, more

examination into the source code is also done in order to draft out the redesigned

pipelined datapath.

Subsequently, the next stage is the actual datapath and control unit redesign

to enable instruction pipelining. More details on the block diagram of the pipelined

datapath is documented in the next chapter of this report. This stage also involves

simulations within the Quartus tool to ascertain the functionality of all the

instructions of the DLX are validated and verified.

Once all the submodules are designed, integration work is undertaken to

complete the entire DLX processor post pipelining implementation completion. One

initial aspiration of the project to implement the design on FPGA was not realized

due to time constraints in the duration of this project. Therefore, all validation has

only been carried out up to functional timing simulation within Quartus.

The final steps involved analyzing the performance of the redesigned DLX

core with pipelining versus the non-pipelined start-point. At this stage, limitations

are also noted for future work recommendations

35

4.2 Tools

4.2.1 Altera Quartus II 6.0 Web Edition

The Altera® Quartus® II design software provides a complete, multiplatform

design environment. It is a comprehensive environment for system-on-a-

programmable-chip (SOPC) design.

The free Quartus II Web Edition software includes everything needed to

design for Altera’s low-cost FPGA and CPLD families. Features include:

• Schematic- and text-based design entry

• Integrated VHDL, Verilog HDL, and SystemVerilog synthesis and support for

third-party synthesis software

• SOPC Builder system generation software

• Place-and-route, verification, and programming functions

• TimeQuest timing analyzer

• Timing optimization advisor

• Resource optimization advisor

36

Figure 4.2: Quartus II Design Flow

The Quartus® II design software delivers the highest productivity and

performance for FPGAs, CPLDs, and structured ASICs and offers numerous design

features to accelerate the design process:

• Incremental compilation to reduce the design cycle time

• SOPC Builder for system-level design

• MegaWizard® Plug-In Manager to quickly and easily integrate a broad

portfolio of intellectual property (IP) cores

• Power analysis tools to meet stringent power requirements

• A memory compiler function to easily use embedded memory

The Quartus II software supports VHDL and Verilog HDL design entry,

graphical-based design entry methods, and integrated system-level design tools. The

Quartus II software integrates design, synthesis, place-and-route, and verification

into a seamless environment, including interfaces to third-party EDA tools.

37

Quartus II integrated synthesis (QIS) supports SystemVerilog-2005, Verilog-

2001, Verilog-1995, VHDL 1993, and VHDL 1987 standards, and also supports

Altera AHDL and schematic (block design file) design entry.

QIS includes advanced synthesis options and compiler directives (attributes)

to guide the synthesis process to achieve optimal results. Included in these synthesis

options is the PowerPlay power analysis and optimization option and the multiplexer

option. The PowerPlay power optimization option controls how aggressive synthesis

optimizes the design for power. The multiplexer optimization option takes advantage

of Altera FPGA architectural features to reduce device area usage up to 20 percent to

fit designs into a smaller device and save cost.

4.2.2 VHDL

VHDL stands for Very-High-Speed-Integrated-Circuit Hardware Description

Language. VHDL is used in the reference DLX project for describing and coding the

DLX processor. VHDL can describe the behaviour and structure of electronic

systems, but is particularly suited as a language to describe the structure and

behaviour of digital electronic hardware designs, such as ASICs and FPGAs as well

as conventional digital circuits.

VHDL is a notation, and is precisely and completely defined by the Language

Reference Manual (LRM). This sets VHDL apart from other hardware description

languages, which are to some extent defined in an ad hoc way by the behaviour of

tools that use them. VHDL is an international standard, regulated by the IEEE. The

definition of the language is non-proprietary.

VHDL is not an information model, a database schema, a simulator, a toolset

or a methodology! However, a methodology and a toolset are essential for the

effective use of VHDL.

38

Simulation and synthesis are the two main kinds of tools which operate on the

VHDL language. The Language Reference Manual does not define a simulator, but

unambiguously defines what each simulator must do with each part of the language.

VHDL does not constrain the user to one style of description. VHDL allows

designs to be described using any methodology - top down, bottom up or middle out.

VHDL can be used to describe hardware at the gate level or in a more abstract way.

39

CHAPTER 5

PIPELINED DLX COMPONENT DESIGN

5.1 Pipelined DLX overview

The overall internal structure of the DLX remains as what was covered in

Chapter 3, whereby the two main components of the DLX are still the controller and

the datapath structure, as depicted in the diagram below.

Figure 5.1: DLX Internal Structural View

In the following pages, the detailed changes made inside the datapath and the

control unit are documented.

Controller

Datapath

Clock

Reset

Memory

Address

Interrupt

Status

Control

Instruction

Ready

Enable

RW

Halt

IR

40

5.2 Pipelined Datapath

Figure 5.2 below showcases the high-level block diagram of the redesigned

DLX datapath to support pipelining.

Figure 5.2: DLX Pipelined Datapath

From the diagram, the most apparent change that can be noticed are the

addition of multiplexers at the inputs of the register file, PC and register C. These

multiplexers are needed in to support the new pipelined nature of the datapath. The

necessity of these multiplexers will become clear when we look at the control unit

L1

L2

ALU

Register

File

add_4

PC

LMDR

SMDR

IR

A

B

C aluoutput

MAR

data_in

instr_addr

instr_in

Controller

data_out

PC1

41

implementation, and the steps being executed concurrently in each pipestage that

required the datapath to perform multiple operations at the same time. The next

section covers the micro-instructions that are executed in the datapath for each

different class of instruction.

5.2.1 Load/Store Instructions

During a Load instruction, the datapath is performing the following 3

operations at the same time for these pipestages:

• EXE: MAR � A + immed

• MEM: LMDR � MEM[MAR]

• WB: RD � LMDR

Whereas during a Store instruction:

• EXE: MAR � A + immed

SMDR � B

• MEM: MEM[MAR] � SMDR

• WB: {idle}

For this purpose, the single memory data register that was sufficient in the

non-pipelined DLX processor needs to be duplicated to have one MDR for load and

store.

42

5.2.2 Arithmetic/Logic Instructions

The following operations are concurrently run in the datapath during any

typical logic or arithmetic instruction execution:

• EXE: Aluout � A op B

• MEM: C � Aluout

• WB: Rd � C

5.2.3 Test and Set Instructions

The following operations are concurrently run in the datapath during any test

and set instruction execution:

• EXE: Rs1 sub Rs2/immed

Aluout � ‘1/’0

• MEM: C � Aluout

• WB: Rd � C

5.2.4 Branch/Jump Instructions

The following operations are concurrently run in the datapath when a branch

or jump instruction is encountered:

• ID: PC1 � PC

• EXE: MAR � immed + PC1

• MEM: PC � MAR (if cond)

• WB: {idle}

43

5.3 Redesigned Control Unit

The finite state machine (FSM) that controls the DLX is contained within the

pipe-control module, and dictates the processors transition between the pipeline

stages. The overall new control unit diagram is depicted below.

Figure 5.3: Control Unit for Pipelined DLX

The pipeline controller block contains the main FSM that determines the state

transitions of the DLX processor. The following pages will cover the function of

each block for IF, ID, EXE, MEM and WB.

Pipeline Controller

ID

EXE

MEM

WB

IR

IF

instr_in

add_pc

en

clock control signals

44

5.3.1 Instruction Fetch (IF)

The Instruction Fetch (IF) stage is responsible for fetching 32-bit long

instructions from memory. It also manages the program counter (PC) and does an

increment of the PC every time the ready signal is asserted.

Figure 5.4: Instruction Fetch (IF) Block Diagram

When the IF block receives a ready assertion from the memory, it checks if

the pipeline controller has a stall_fetch asserted. If not, it deasserts the

fetch_memory_not_ready signal to begin the instruction load from memory. At the

same time, the pc_latch is asserted to increment the program counter as well as the

instruction register.

IF

clock

reset

stall_fetch

ready

fetch_memory_not_ready

pc_latch

fetch_mem_ctrl

ir_latch_en

45

5.3.2 Instruction Decode (ID)

The Instruction Decode stage is responsible for decoding the instruction in

the IR. Based on the type of instruction and its operands, it will fetch the values

from the registers or use the immediate values, and place them in register A and

register B for the subsequent stage. At the same time, it determines if the instruction

is a branch instruction, and will calculate the condition and target addresses.

Figure 5.5: Instruction Decode (ID) Block Diagram

The values for register A and register B are parsed using the rs1_out and

rs2_out signals. But before that is done, it checks if the stage is stalled by checking

the assertion on the stall signal that arrives at the ID block from the pipeline

controller block.

ID

clock

stall

instr_in

reg_value

ir_1_latch

rs2_out

rs1_out

46

5.3.3 Execution Stage (EXE)

The execution (EXE) stage is very much similar to the original control unit

from the non-pipelined DLX processor. Most of the control signals that go to the

datapath which are responsible for the correct operation of the datapath’s registers,

and ALU operations, originate from this block.

Figure 5.6: Execute (EXE) Block Diagram

Unique to the pipelined version of this block, there is an additional stall

signal as an input to the EXE module, which is controlled by the pipeline controller.

Its function is the same as in the ID block; to stall the pipeline. Therefore, before any

assertions are made to the output of the EXE block, the stall signal is checked.

EXE

clock

stall

reset

s1_enab

s2_enab

alu_op_sel

lmdr_latch

dest_enab

const_sel

immed_sel

exc_enab

icr_shifter

iar_mux

iar_latch

test_set_mux

smdr_latch

alu_neg

alu_zero

47

5.3.4 Memory Stage (MEM)

As its name suggests, the memory stage (MEM) is responsible to accessing

the data memory for load/store instructions. The ready signal is the assertion

received from the memory indicating that there data on the address bus is valid. The

output signals from the MEM block go to the pipeline controller as well as the

datapath, to control the memory data registers and their enable signals, to load the

appropriate data into the registers.

Figure 5.7: Memory (MEM) Block Diagram

One can notice that there is no stall signal from the pipeline controller to this

stage and the subsequent write-back stage as well. This is because once the

execution (EXE) stage is reached, there is no need for any stalls in the pipeline, since

any need to stall the pipeline is only determined by the operation, which will be fully

decoded by the instruction-decode (ID) stage.

MEM

clock

reset

ready

mem_ctrl

entry_mux

pc_mux

c_latch

lmdr_latch

lmdr_ctrl

memory_not_ready

adr_ls2

48

5.3.5 Write Back (WB)

The write-back block is the simplest block in the control unit and is only

responsible for storing instructions back into the destination register, Rd. The rf_in

signal enables the register file for writing and the content of register C is written in

the register denoted by the rd signal.

Figure 5.8: Write-Back (WB) Block Diagram

WB

clock

wb_mux

rd

rf_in

CHAPTER 6

RESULTS AND PERFORMANCE ANALYSIS

6.1 Overview

Once all the submodules in the new control unit were completed and

validated, these submodules were integrated to form the new control unit, and

subsequently integrated with the redesigned datapath to form the DLX processor.

Several functional simulations were carried out within Quatus tool to verify

and validate the functionality of the new DLX with pipelining.

6.2 Functional Validation

A simple program was written and its output waveform was observed in

Quartus waveform viewer compilation, synthesis and simulation. The following 3-

instruction program was used:

addi r3,r7,0x4400

addi r7,r15,0x4444

xor r17,r3,r15

50

The instructions are then translated into machine code, as depicted in the

following instruction breakdown:

DLX Instruction : addi r3,r7,0x4400

Details : add immediate value 0x4400 to R7 and store in R3

Instruction format : I-type

0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1

0 5 6 10 11 15 16 31

0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0

opcode Rs1 Rd Immediate

Machine Code : 0022C70A (big-endian)

DLX Instruction : addi r7,r15,0x4444

Details : add immediate value 0x4444 to R15 and store in R7

Instruction format : I-type

Machine Code : 2222F70A (big-endian)

DLX Instruction : xor r17,r3,r15

Details : exclusive-or contents of R15 and R3, and store in R17

Instruction format : R-type

Machine Code : 5011F600 (big-endian)

These instructions are then fed into the DLX processor through the data bus

and its simulation output waveform is capture in Quartus as shown subsequently.

51

Figure 6.1: Simulation Waveform

From the timing simulation pattern, 0x0022C70A can be observed at the data

bus in (d_bus_in). The resulting immediate value 0x4400 can be seen at reg_c_in

which is the output register of the ALU and the input to the register file in the

datapath where the result will be written into R3.

The next add-immediate instruction is also observed on the data bus,

0x2222F70A, and its results can be observed at the reg_c_in value. Finally, the

exclusive-or instruction 0x5011F600 is placed in the data bus. When the xor

instruction is executed, reg_c_in shows the value of 0x0044 which is R3 ⊕ R15 =

0x4400 ⊕ 0x4444 = 0x0044.

From this simple simulation, it is concluded that the functional properties of

the DLX is intact and the operation results are accurate.

52

6.3 Gate Count and Frequency Statistics

Using the Quartus classic timing analyzer tool, the redesigned DLX processor

can be analyzed for post-synthesis gate count as well as maximum frequency

attainable, Fmax.

Figure 6.2: Gate Count Snapshot

Based on the Quartus fitter summary report, the new pipelined DLX utilizes

5096 logic elements, versus 4198 logic elements utilized by the non-pipelined DLX

processor. This represents a 21.39% increase in logic utilization with the

introduction of the redesign datapath and control unit that contains all the new

submodules for pipeline stages.

53

Figure 6.3: Redesigned DLX Fmax

Looking at the Quartus classic timing analyzer report, the new DLX can

achieve a maximum clock frequency, Fmax, of 11.69Mhz. Compared to the original

non-pipelined DLX, this represents a 28.78% slow-down in Fmax whereby the non-

pipelined DLX achieved an Fmax of 16.67Mhz. This can likely be attributed to

possible critical paths within the new datapath or control unit that is impeding the

speed of the processor.

6.4 Instruction Execution Speed-Up

The true measure of the effectiveness of the new pipelined architecture of the

DLX is determined by the instruction speed-up attainable through the introduction of

the instruction pipeline. A comparion was made better the non-pipelined and

pipelined-DLX running the exact same instructions.

54

Figure 6.4: Non-pipelined DLX execution waveform

Figure 6.5: Pipelined DLX execution waveform

Running the exact same 3 instructions, the pipelined-DLX completes its

execution 3 clock-cyles after than the non-pipelined DLX, accounting for a 25%

instruction speed-up (12 versus 9 clock cycles).

55

Based on the understanding of pipelined microprocessors, it is a known fact

that the best case instruction speed up attainable equals the number of pipestage of

the design. Therefore, hypothetically, the pipelined DLX can offer up to 5 times

performance improvement in terms of clock cycles reduction, if the program

executed involves thousands of instructions and the percentage of branch instructions

in the program are kept to a minimal since branch instructions will cause a 3-cycle

penalty to stall the pipeline and fetch the new instruction.

Table 6.1 summarizes the metrics of comparison between the non-pipelined

versus pipelined DLX processor:

Table 6.1: Non-pipelined versus Pipelined DLX

Non-pipelined DLX Pipelined DLX Difference (%)

Gate Count 4198 5096 + 21.39%

Maximum frequency, Fmax 16.67MHz 11.67MHz - 30.00%

Number of clock cycles 12 cycles 9 cycles - 25.00%

Speed-up 1 1.25 + 25.00%

CHAPTER 7

CONCLUSION AND FUTURE WORK RECOMMENTATIONS

7.1 Conclusion

The pipelined DLX processor was successfully designed and implemented

using VHDL based loosely on the previous DLX project carried out in UTM. The

redesigned DLX processor was successfully simulation using Altera’s Quartus 2 7.2

Web Edition software suite.

The pipelined DLX utilized a five-stage instruction pipeline (instruction

fetch, instruction decode, execute, memory and write-back) to operate on instruction

fed into the processor.

This project consisted of several phases of work. In this first part of the

project, much effort was been spent to understand the DLX architecture and source

code, as well as delving into pipelining concepts and design considerations. The

next stage involved successfully re-simulated and verifying the previous non-

pipelined DLX in Quartus. This is imperative to verifying the functionality of the

previous design can be reproduced, as well as strengthens knowledge of the DLX

architecture and familiarizing with the Quartus design tools.

The two main components of the DLX – the datapath and the control unit -

were redesigned to enable pipelined execution of instructions. Once the VHDL

coding of the submodules were completed, all the sub-blocks were integrated and

57

validated in Quartus. A lot of time and effort was spent iterating between coding the

blocks, integrating the design and validating the implementation in Quartus.

Once the design was functionally validated, performance analysis work was

carried out, focusing on comparing the previous non-pipelined DLX versus the

redesigned pipelined-DLX processor.

The pipelined DLX showcased a 25% instruction execution speedup

measured by number of clock cycles, as compared to the non-pipelined DLX that

came at a cost of 23% increase in logic element utilization as measured by Quartus

tool. Finally, exploration of possible future work to further improve the performance

of the pipelined DLX processor was done.

Throughout the design, implementation and validations stages of the project,

numerous hindrances were encountered, which includes lack VHDL coding

proficiency, familiarizing with the Quartus tool, as well as handling branch

instructions in the pipelined architecture. Each setback was handled meticulously

and diligently.

In summation, a wealth of knowledge was gained from this project which

could not be lesson-taught. In depth knowledge of computer architecture, VHDL

coding, pipelining concepts and implementation, branching handling in

microprocessor design as well as generic skills were among the expertise acquired

through this project.

7.2 Recommendations for Future Work

Among several possible future work recommendations presented here, the

most pivotal would be a full scale implementation of the pipelined DLX on FPGA

(and possibly fabricated ASIC). With the design implemented on FPGA, real world

58

performance of the pipelined DLX can be measures, particularly with the presence of

real external memory interactions. This will introduce the need for better timing

synchronization and handling between the DLX processor and the external memory.

Another possible path to explore in further improving the performance of the

DLX processor would be the introduction of a branch-prediction algorithm and

hardware module. This can be realized through multiple fetches from memory per

load cycle and a sub-module that looks-ahead two instruction in advance to prepare

for branch instructions in the program. This will eliminate the 3-cycle penalty paid

whenever an instruction in the pipeline is decoded as a branch, resulting in the entire

pipeline being flushed.

The addition of a cache (either data cache or instruction cache) would also

significantly increase the performance of the DLX processor. In this case, an

instruction cache would be more ideal to work with the pipelined architecture since

latency to fetch the instructions from the memory can be reduced to a minimum if

instructions are cached, regardless of whether the branch prediction unit is

implemented.

REFERENCES

1. Hennessy, John L and Peterson, David A (1990). Computer Architecture: A

Quantitative Approach. USA: Morgan Kauffmann. San Francisco, USA.

2. Rajagopal, Selvakumar. FPGA Implementation of DLX Microprocessor with

Wishbone SoC Bus. Bachelor’s Thesis. Universiti Teknologi Malaysia; 2005

3. Amde, M.; Blunno, I. and Sotiriou, C.P.; (2003). Automating the Design of an

Asynchronous DLX Microprocessor. Proceedings of 40th

Design Automation

Conference (DAC), 2-6 June 2003 Page(s):502 - 507.

4. Gumm, Martin (1995). VHDL Modeling and Synthesis of the DLXS RISC

Processor. Germany: University of Stuttgart

5. Buhler, M. and Baitinger, U.G.(1998). VHDL-based development of a 32-bit

pipelined RISC processor for educational purposes, Ninth Mediterranean

Electrotechnical Conference (MELECON 98), Volume 1, 18-20 May 1998

Page(s):138 - 142 vol.1.

6. Ashenden, Peter J. (2002). The Designer’s Guide to VHDL, 2e, Morgan

Kaufmann, San Francisco.

APPENDIX A

VHDL SOURCE CODE FOR STRUCTURAL DLX CORE

//dlx.vhd

library IEEE;

USE IEEE.std_logic_1164.ALL;

use WORK.dlx_instructions.all;

use WORK.control_types_2.all;

USE WORK.dlx_types.all;

entity dlxp is

port (

clock : in std_logic;

a_bus : out dlx_address;

d_bus_in : in dlx_word;

d_bus_out : out dlx_word;

enable : out dlx_nibble;

rw : out std_logic;

error : out std_logic;

ready : in std_logic;

reset : in std_logic; -- asynchronous reset

halt : in std_logic; -- freeze of processor state

intrpt : in dlx_nibble; -- interupt signals (maskable)

pad_out_en : out std_logic; -- output pads enable (0 = out, 1 = tri)

pad_io_sw : out std_logic; -- io pads switch (0 = output, 1 =

input)

---------------------------------------------------------------

instr_out : out dlx_word;

src_1 : out dlx_word;

src_2 : out dlx_word;

dst : out dlx_word;

rs1_out : out dlx_reg_addr;

rs2_out : out dlx_reg_addr;

rd_out : out dlx_reg_addr;

--

-- control outputs

--

s1_enab : out std_logic_vector(0 to 5); -- select s1 source

s2_enab : out std_logic_vector(0 to 3); -- select s2_source

dest_enab : out std_logic_vector(0 to 4); -- select destination

alu_op_sel : out std_logic_vector(0 to 3); -- alu operation

const_sel : out std_logic_vector(0 to 1); -- select const for s1

--rf_op_sel : out std_logic_vector(0 to 2); -- select reg file

operation

immed_sel : out std_logic_vector(0 to 1); -- select immediate

from ir

61

exc_enab : out std_logic_vector(0 to 8); -- enable set exception

bit

mem_ctrl : out std_logic_vector(0 to 7); -- memory control lines

reg_c_in : out std_logic_vector(31 downto 0);

-- regf2out : out dlx_word;

lmdr_latch : out std_logic;

fetch_mem_ctrl : out std_logic

-- instr. reg. content

);

end dlxp;

--------------------------------------------------------------------------

-- Structural architecture of the datapath

--

-- file datapath-structural.vhd

--------------------------------------------------------------------------

architecture structural of datapath is

component bus_const32

port (

q1 : out dlx_word;

q2 : out dlx_word;

out_en1 : in std_logic;


sel : in std_logic_vector(0 to 1));

end component;

component word_mux2

port (in0, in1 : in dlx_word;

y : out dlx_word;

sel : in std_logic);

end component;

component word_latch

port (


d : in dlx_word;

q : out dlx_word;

latch_en : std_logic);

end component;

component word_reg_1e

port (


d : in dlx_word;

q : out dlx_word;

latch_en : in std_logic;

out_en : in std_logic);

end component;

component word_reg_1e1

port (


d : in dlx_word;

q1, q2 : out dlx_word bus;


out_en1 : in std_logic);

end component;

component mdr

port (


d : in dlx_word;

62

q1, q2 : out dlx_word;



shift_ctrl : in std_logic_vector(0 to 2);

mar_ls2_in : in std_logic_vector(0 to 1));

end component;

component reg_file

port (


addr_out1 : in dlx_reg_addr;

q1 : out dlx_word;

addr_out2 : in dlx_reg_addr;

q2 : out dlx_word;

addr_in : in dlx_reg_addr;

d : in dlx_word;

write_en : in std_logic);

end component;

component icr

port (


d : in dlx_word; -- data in from dest_bus

latch_en : in std_logic; -- enable load from dest_bus

q : out dlx_word; -- output to s_bus

out_en : in std_logic; -- enable output to s_bus

--

s_en : in std_logic; -- set s bit

ioc_en : in std_logic; -- set ioc bit

irra_en : in std_logic; -- set irra bit

iav_en : in std_logic; -- set iav bit

dav_en : in std_logic; -- set dav bit

ovad_en : in std_logic; -- set ovad bit

ovar_en : in std_logic; -- set ovar bit

priv_en : in std_logic; -- set priv bit

super : out std_logic; -- supervisor bit

--

intrpt_in : in dlx_nibble; -- input from intrpt. port

intrpt_en : in std_logic; -- enable load from intrpt.

port

intrpt : out std_logic); -- at least one masked

interrupt active

end component;

component ir

port (


d : in dlx_word;


ir_out : out dlx_word;

immed_o1_en : in std_logic;

immed_out1 : out dlx_word;

immed_o2_en : in std_logic;

immed_out2 : out dlx_word;

immed_size : in std_logic; -- '0'-> 16 bit /

'1'-> 26 bit

immed_sign : in std_logic); -- '0'-> unsigned

/ '1' signed

end component;

component alu

port (


s1 : in dlx_word;

s2 : in dlx_word;


result : out dlx_word;

func : in dlx_nibble;

63

zero : out std_logic;

negative : out std_logic;

overflow : out std_logic);

end component;

--

-- internal busses

--

signal s1_bus : dlx_word;

signal s2_bus : dlx_word;

signal dest_bus : dlx_word;

signal addr_mux_in0 : dlx_word;

signal addr_mux_in1 : dlx_word;

signal mdr_in : dlx_word;

signal reg_file_out1: dlx_word;

signal reg_file_out2: dlx_word;

signal reg_file_in : dlx_word;

--

-- other lines

--

signal intrn_alu_overflow : std_logic;

begin

dp_alu : alu

port map (clock => clock,s1 => s1_bus, s2 => s2_bus, latch_en =>

alu_latch_en,

result => dest_bus, func => alu_func, zero => alu_zero,

negative => alu_negative, overflow => intrn_alu_overflow);

dp_reg_file : reg_file

port map (clock => clock,addr_out1 => reg_addr_rs1, q1 => reg_file_out1,

addr_out2 => reg_addr_rs2, q2 => reg_file_out2,

addr_in => reg_addr_rd, d => reg_file_in,

write_en => regf_wr_en);

a_reg : word_reg_1e

port map (clock => clock, d => reg_file_out1, q => s1_bus,

latch_en => a_latch_en, out_en => a_out_en);

b_reg : word_reg_1e

port map (clock => clock, d => reg_file_out2, q => s2_bus,

latch_en => b_latch_en, out_en => b_out_en);

c_reg : word_latch

port map (clock =>clock, d => dest_bus, q => reg_file_in, latch_en =>

c_latch_en);

pc_reg : word_reg_1e1

port map (clock => clock, d => dest_bus, q1 => s2_bus, q2 =>

addr_mux_in0,

latch_en => pc_latch_en, out_en1 => pc_out_en);

instr_reg : ir

port map (clock => clock, d => data_in, latch_en => ir_latch_en, ir_out

=> instr_out,

immed_o1_en => ir_immed_o1_en, immed_out1 => s1_bus,

immed_o2_en => ir_immed_o2_en, immed_out2 => s2_bus,

immed_size => ir_immed_size, immed_sign => ir_immed_sign);

icr_reg : icr

port map (clock => clock, d => dest_bus, q => s1_bus, latch_en =>

icr_latch_en,

out_en => icr_out_en,

s_en => icr_s_en, ioc_en => icr_ioc_en, irra_en => icr_irra_en,

iav_en => icr_iav_en, dav_en => icr_dav_en,

64

ovad_en => icr_ovad_en, ovar_en => icr_ovar_en,

priv_en => icr_priv_en,

super => icr_super, intrpt_in => icr_intrpt_in,

intrpt_en => icr_intrpt_en, intrpt => icr_intrpt);

iar_reg : word_reg_1e

port map (clock => clock, d => dest_bus, q => s1_bus,

latch_en => iar_latch_en, out_en => iar_out_en);

tbr_reg : word_reg_1e

port map (clock => clock, d => dest_bus, q => s1_bus,

latch_en => tbr_latch_en, out_en => tbr_out_en);

mar_reg : word_reg_1e1

port map (clock => clock, d => dest_bus, q1 => s2_bus, q2 =>

addr_mux_in1,

latch_en => mar_latch_en, out_en1 => mar_out1_en);

addr_mux : word_mux2

port map (in0 => addr_mux_in0, in1 => addr_mux_in1, y => addr_out,

sel => addr_mux_sel);

mdr_reg : mdr

port map (clock => clock, d => mdr_in, q1 => s1_bus, q2 => data_out,

latch_en => mdr_latch_en, out_en1 => mdr_out1_en,

shift_ctrl => mdr_sh_ctrl, mar_ls2_in => addr_mux_in1(30 to

31));

mdr_mux : word_mux2

port map (in0 => dest_bus, in1 => data_in, y => mdr_in,

sel => mdr_mux_sel);

bus_const: bus_const32

port map ( q1 => s1_bus, out_en1 => const_o1_en,

q2 => s2_bus, out_en2 => const_o2_en,

sel => const_sel);

alu_overflow <= intrn_alu_overflow;

mar_adr_ls2 <= addr_mux_in1(30 to 31);

mar_adr_msb <= addr_mux_in1(0);

dest <= dest_bus ;

source_1 <= s1_bus;

source_2 <= s2_bus;

reg_c_in <= reg_file_in;

regf1out <=reg_file_out1;

regf2out <= reg_file_out2;

end structural;

vhdl implementation of pipelined dlx microprocessor ignatius

Documents