vhdl implementation of pipelined dlx microprocessor ignatius
TRANSCRIPT
VHDL IMPLEMENTATION OF PIPELINED DLX MICROPROCESSOR
IGNATIUS EDMOND ANTHONY
UNIVERSITI TEKNOLOGI MALAYSIA
iii
For my beloved parents, sisters and friends, and not forgetting
my dearest partner-in-crime, Sheena.
iv
ACKNOWLEDGEMENT
I would like to extend my sincerest gratitude and appreciation to anyone and
everyone who has contributed explicitly or implicitly towards the success of this
project entitled “VHDL Implementation of Pipelined DLX Microprocessor”.
Acknowledgment is particularly given to my project supervisor, Associate Professor
Muhammad Mun’im bin Ahmad Zabidi, who despite his tight schedule always
makes time to oversee the progress of this project besides offering advice on how to
proceed further whenever hindrances are encountered.
Finally, a big thank you to my family, who are always by my side offering me
moral support, and to Sheena, who was relentlessly there as a shoulder to lean on in
my darkest hours.
v
ABSTRACT
The 32-bit load/store DLX processor architecture is a generic RISC processor
designed by Hennessy and Patterson for pedagogical purposes. The DLX processor
design abstracts many features of general-purpose commercial processors, and is a
well-understood computer architecture, providing a good architectural model for
study, not only because of the popularity of this type of machine, but also because it
is easy to understand. Utilizing open source hardware such as the DLX core yields
the apparent advantage of free-for-all distribution as well as having source codes that
are is available and open, allowing for source code modification at-will. This project
aims to continue previous work on integration of the DLX core by adding instruction
pipelining which was excluded from the previous project’s scope due to complexity
and time limitations. Instruction execution speedup and performance was left on the
table to be dealt with in future work. Since the DLX microprocessor was, by nature,
a 5-stage pipelined microprocessor, it can be expected that the core’s performance on
instruction execution can be sped up with a pipeline implementation. Comparison
between the non-pipelined and pipelined DLX were also performed to verify this
instruction execution speedup expectation.
vi
ABSTRAK
Senibina pemproses DLX merupakan suatu pemproses generik RISC 32-bit
yang direkacipta oleh Hennnesy and Patterson bagi tujuan peyelidikan and
pendidikan. Senibina pemproses DLX merangkumi pelbagai ciri-ciri and fungsi
pemproses umum di pasaran, dan bukan sahaja merupakan senibina komputer yang
mudah difahami, tetapi juga amat popular. Menggunakan pemproses sumber terbuka
atau open-core seperti mesin DLX ini memberi kelebihan dalam tersedianya kod-kod
sumber secara terbuka yang membenarkan dan memudahkan pengubahsuaian untuk
keperluan projek. Matlamat projek in adalah untuk meneruskan projek sebelumnya
di mana pemproses DLX dan Wishbone interface diintegrasikan, tetapi dengan
menambah fungsi pipelining untuk pemprosesan suruhan. Dengan penambahan ciri
ini, adalah dijangka bahawa tempoh pemprosesan suruhan dapat disingkatkan
memandangkan cara pemprosesan suruhan dalam mesin DLX dilakukan dalam lima
peringkat. Perbandingan prestasi pemproses DLX sebelum and selepas implementasi
ciri pipelining turut dilaksanakan dalam projek ini untuk mengesahkan jangkaan
awal.
vii
TABLE OF CONTENTS
CHAPTER TITLE PAGE
DECLARATION ii
DEDICATION iii
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRAK vi
TABLE OF CONTENTS vii
LIST OF TABLES x
LIST OF FIGURES xi
LIST OF ABBREVIATIONS xiii
LIST OF APPENDICES xiv
1 PROJECT OVERVIEW 1
1.1 Background 1
1.2 Objectives 3
1.3 Scope of Work 3
1.4 Expected Results 4
1.5 Report Layout 5
2 LITERATURE REVIEW 6
2.1
2.2
Processor Selection Considerations of Systems-On-
Chip
DLX Implementations
6
8
2.2.1 ASPIDA DLX Project 8
2.2.1 University of Stuttgart DLX Project 11
2.3 Pipeline Design Considerations 12
viii
3 DLX PROCESSOR ARCHITECTURE 15
3.1 Overview 15
3.2 External View of the DLX 15
3.2.1 The DLX Interface 15
3.2.2 Memory Interface and Access 16
3.2.3 Reset 18
3.2.4 Halt 19
3.2.5 Error 19
3.3 The DLX Programming Model 19
3.3.1 Accessible Registers 19
3.3.2 The DLX Instruction Format 20
3.3.3 The DLX Instruction Set 20
3.4 Internal Structure of the DLX 25
3.4.1 The Datapath 26
3.4.2 The Control Unit 28
3.4.3 The Basic Execution Steps 29
4 DESIGN WORKFLOW, METHODOLOGY AND
TOOLS
32
4.1 Design Workflow 32
4.2 Tools 34
4.2.1 Altera Quartus II 6.0 Web Edition 34
4.2.2 VHDL 36
5 PIPELINED DLX COMPONENT DESIGN 38
5.1 Pipelined DLX Overview 38
5.2 Pipelined Datapath 39
5.2.1 Load / Store Instruction 40
5.2.2 Arithmetic/Logic Instruction 40
5.2.3 Test and Set Instructions 41
5.2.4 Branch/Jump Instructions 41
5.3 Redesigned Control Unit 42
5.3.1 Instruction Fetch (IF) 43
ix
5.3.2 Instruction Decode (ID) 44
5.3.3 Execution Stage (EXE) 45
5.3.4 Memory Stage (MEM) 46
5.3.5 Write Back (WB) 47
6 RESULTS AND PERFORMANCE ANALYSIS 48
6.1 Overview 48
6.2 Functional Validation 48
6.3 Gate Count and Frequency Statistics 51
6.4 Instruction Execution Speed-Up 52
7 CONCLUSION AND FUTURE WORK
RECOMMENTATIONS
55
7.1 Conclusion 55
7.2 Recommendations for Future Work 56
REFERENCES
59
Appendix A 59-63
x
LIST OF TABLES
TABLE NO. TITLE PAGE
2.1 Open Cores Architecture Comparison 11
3.1 DLX arithmetic and logic instructions 22
3.2 DLX Test and Set instructions 23
3.3 DLX Branch instructions 24
3.4 DLX special instructions 24
3.5 DLX load-store instructions 25
3.6 ALU Operations 28
6.1 Non-pipelined versus Pipelined DLX 55
xi
LIST OF FIGURES
FIGURE NO. TITLE PAGE
2.1 DLX Programming Model 8
2.2 ASPIDA DLX Instruction Layout 9
2.3 Supported Integer Instructions 10
2.4 De-synchronized DLX datapath 11
2.5 Classic 5-stage instruction pipeline 14
3.1 External interface of the DLX 16
3.2 Byte positions in a DLX word (big-endian) 16
3.3 Memory Operation Modes 17
3.4 Memory Read Access 17
3.5 Memory Write Access 18
3.6 DLX instruction formats 20
3.7 Internal structure of the DLX 26
3.8 DLX datapath (non-pipelined) 27
3.9 Structure of the DLX Control Unit 29
4.1 Design Methodology Workflow 32
4.2 Quartus II Design Flow 35
5.1 DLX Internal Structural View 38
5.2 DLX Pipelined Datapath 39
xii
5.3 Control Unit for Pipelined DLX 42
5.4 Instruction Fetch (IF) Block Diagram 43
5.5 Instruction Decode (ID) Block Diagram 43
5.6 Execute (EXE) Block Diagram 44
5.7 Memory (MEM) Block Diagram 45
5.8 Write-Back (WB) Block Diagram 46
6.1 Simulation Waveform 49
6.2 Gate Count Snapshot 50
6.3 Redesigned DLX Fmax 51
6.4 Non-pipelined DLX execution waveform 52
6.5 Pipelined DLX execution waveform 52
xiii
LIST OF ABBREVIATIONS
ALU - Arithmetic-Logic Unit
AMBA - ARM Bus Architecture
ARM - Advanced RISC Machine
ASIC - Application-Specific Integrated Circuit
CLK - Clock
CPU - Central Processing Unit
CAM - Content Addressable Memory
EXE - Execution stage
FPGA - Field-Programmable Gate Array
FSM - Finite State Machine
GPR - General Purpose Register
IAR - Instruction Address Register
ICR - Interrupt Control Register
ID - Instruction-Decode
IF - Instruction-Fetch
ISA - Instruction Set Architecture
MAR - Memory Address Register
MDR - Memory Data Register
PC - Program Counter
RISC - Reduced Instruction Set Computing
RW - Read/Write
TBR - Trap Branch Register
VHDL - Very-High-Speed-Integrated-Circuit Hardware Description
Language
VLSI - Very Large Scale Integration
WB - Write-Back
xiv
LIST OF APPENDICES
APPENDIX TITLE PAGE
A VHDL Source Code for Structural DLX Core 59
CHAPTER 1
PROJECT OVERVIEW
1.1 Background
The 32-bit load/store DLX processor architecture is a generic RISC processor
designed by Hennessy & Patterson for pedagogical purposes. The DLX processor
design abstracts many features of general-purpose commercial processors, and is a
well-understood computer architecture.
The DLX provides a good architectural model for study, not only because of
the popularity of this type of machine, but also because it is easy to understand. Like
most load/store machines, the DLX emphasizes a simple load/store instruction set,
design for pipelining efficiency, an easily decoded instruction set and efficiency as a
compiler target.
The Wishbone Bus is an open source hardware computer bus intended to let
the parts of an integrated circuit communicate with each other. The aim is to allow
the connection of differing cores to each other inside of a chip or system-on-chip.
Utilizing open source hardware such as the DLX core and Wishbone bus (or
its competitor – the AMBA bus) yields benefits which include solutions for most of
the problems associated with proprietary cores. Besides the apparent advantage of
free-for-all distribution, utilizing open source hardware standards and open
microprocessor cores tout:
2
• Each core will have a larger user base, which will ensure better support, better
documentation and better implementation examples to work from.
• The source is available, so any developer can find out what he or she needs to
know about the core.
• Eventually, as cores and standards for them are developed, cores will become
more standards-compliant than proprietary cores
• Allows for source code modification at-will, which enables designers to fine-
tune and tweak any design for any design constraint – gate count, performance,
power, etc.
While previous work on integration of the DLX core and Wishbone bus
interface has been undertaken and completed, instruction pipelining was excluded
from the project’s scope due to complexity and time limitations. A DLX
microprocessor core with a non-pipelined instruction execution data path was used to
showcase the microprocessor core-Wishbone bus integration functionality on the
FPGA.
Since the previous project’s focus was on functionality, instruction execution
speedup and performance was left on the table to be dealt with in future work. Since
the DLX microprocessor was, by nature, a five-stage pipelined microprocessor, it can
be expected that the core’s performance on instruction execution can be sped up with
a pipeline implementation.
1.2 Objectives
The objective of this project is to design a five-stage pipelined DLX
microprocessor for the purpose of instruction execution speedup and performance
improvement.
As part of the project, the performance of instruction execution speedup of
the enhanced DLX processor with pipelined-instruction execution versus the non-
3
pipelined DLX will be evaluated and analyzed using a predetermined test suite which
will consist of several small programs.
Finally, exploration and investigation will be done on several design-for-test
(DFT) feature integrations into the DLX-Wishbone system for improved system
debug-ability as future work.
1.3 Scope of Work
This project is focused on the incremental enhancement of an existing DLX
processor design with Wishbone bus interface integration in UTM [2], to modify the
data path unit and controller unit for pipelined instruction execution in five stages:
Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access
(MEM) and Write-Back (WB).
The reference source code for this project is referenced from the DLX project
by the University of Stuttgart, Germany which implements a non-pipelined, non-
synthesizeable flavour of the DLX processor.
Implementation of the DLX enhancements would involve coding in hardware
description language VHDL. Altera’s Quartus 6.1 Web Edition is the tool of choice
for design entry, logic synthesis (compilation) and simulation.
In the work flow of the project, the main emphasis would be on adding the
pipelining capability into the DLX processor’s datapath unit as well as any
incremental changes in the control unit to support the pipelined execution. The
Wishbone bus interface integration into the design will be used as-is from the
previous project with the expectation that any previous issues have been resolved.
4
While the eventual target implementation of the design would be in ASIC,
this project’s implementation level will only be restricted to functional and timing
simulation within Altera’s Quartus software.
As a final outcome, the implementation is validated and verified for
functionality correctness on through simulation. Performance analysis and evaluation
would be carried out using a predetermined suite of small programs to be executed
on the integrated DLX system.
1.4 Expected Results
The pipelined DLX processor is successfully designed and simulated in
Altera’s Quartus software. The processor implementation is validated and verified
using FFT computation task or other simple sorting programs.
The enhanced pipelined-DLX will offer at least 1.5X instruction execution
speedup versus non-pipelined core measured by CPU cycle time required to
complete predefined computation task-list on both processors. This will serve as the
baseline expectation to justify that the additional logic overhead incurred due to the
pipelined-datapath translates to real-world performance speedup.
5
1.5 Report Layout
The layout of this report would be as follows:-
Chapter 1: Brief overview of project, including objectives and work scope.
Chapter 2: Literature review of other existing DLX projects undertaken by
other universities, design considerations for pipelining, and
microprocessor selection considerations.
Chapter 3: Overview of the DLX processor and instruction set architecture,
pipelining concepts and design considerations.
Chapter 4: Design workflow, methodology and tools.
Chapter 5: Pipelined DLX microprocessor components and design.
Chapter 6: Results and performance analysis.
Chapter 7: Conclusion and future work recommendations.
CHAPTER 2
LITERATURE REVIEW
2.1 Processor Selection Considerations for Systems-On-Chip
In selecting the DLX as the baseline core for this project’s implementation
and processor redesign for pipelining support, several open-source processors
designs were evaluated based on several criteria
• Processor and instruction set architecture complexity
• Availability of documentation
• Compiler availability
Table 2.1 highlights comparison made between the DLX, Leon and
OpenRISC microprocessors.
The DLX microprocessor is chosen as the CPU for this project as was the chose
made in the previous project undertaken. The main motivations remain; it’s free and the
DLX architecture complexity is lower as compared to the other two open source
processors, which utilize windowed register architecture versus the load-store
architecture of the DLX.
7
Table 2.1: Open Cores Architecture Comparison
DLX Leon OpenRISC
Windowed
Registers
No Yes No
Number of general
purpose registers
32 40 to 520
(136 typical)
32 GPRs + many system
control registers
Most similar to MIPS SPARC None
Complexity Low High High
Reference
Document
Hennessy &
Patterson text
IEEE 1754 Opencores.org
Windowed register is an architecture where more than 32 general purpose
registers is used, and in some designs as much as 256 registers. But at any time, only 32
registers are visible to the programmer. To use other registers as well, the programmer
has to ‘slide’ up or down a pointer that points a predetermined window range at a time.
This architecture is advantageous as there is no need of using a stack when
calling subroutines, as all information can be stored in the register. While the
disadvantage is that more decoding circuit is required to implement this windowed
function, which makes the design to be more complex compared to those using non-
windowed system.
The DLX instruction set architecture is akin to the MIPS ISA in many ways,
consisting of a full set of general purpose programmer-accessible register as depicted
in Figure 2.1. Both employ a load-store architecture.
Figure 2.1: DLX Programming Model
R0
R1
R2
R31
PC
ICR
IAR
TBR
8
The DLX also has several other registers namely the TBR, ICR and IAR
which will be dealt with in more detail in Chapter 3. One particular point of interest
is the absence of a dedicated status register in the DLX. Instead, all set and test
instructions utilize register-0 (R0) to store the result flag (either 0 or 1 denoting true
or false). More details will be discussed in Chapter 3 and all set/test instructions are
listed in Table 3.2. There are only two different branching instructions that exist in
the DLX. Coupled with this unique implementation of set and test instructions, they
are able to handle a variety of different branching conditions, indirectly reducing
pipeline stalls due to branch hits.
2.2 DLX Implementations
There are several open-source implementations of the DLX available today.
Among the distinguishing traits of these flavours of the DLX implementation include
pipelined or non-pipelined, synthesizable versus non-synthesizeable code, as well as
synchronous versus asynchronous design implementations. In this report, we delve
into two such DLX implementations, namely the ASPIDA DLX project and the
University of Stuttgart DLX project.
2.2.1 ASPIDA DLX Project
The ASPIDA open-source DLX supports the full DLX integer ISA. Floating
point operations are not supported in the current version of the processor. The
ASPIDA DLX contains two memory interfaces, following the original DLX model,
which support byte, half-word and word transfers. Branches follow the conventional
RISC semantics and require a branch delay slot, i.e. the instruction followed by the
branch is always executed. A vectored interrupt co-processor, including an interrupt
cause register and an exception program counter, is included.
9
This European Union-funded ASPIDA (ASynchronous oPen-source Ip of the
DLX Architecture) project has the goal of promoting the adoption of asynchronous
design, by delivering an open-source asynchronous synthesizable DLX processor
core, supporting the full integer Instruction Set Architecture, interrupts and byte
addressable memory. It will also deliver an asynchronous interconnect fabric based
on the CHAIN architecture, developed by the University of Manchester [3].
The ASPIDA DLX supports the three operation types of the DLX ISA:
• I-type: logic/arithmetic operations performed between a register and an
immediate value. Conditional branches are also I-type instructions.
• R-type: logic/arithmetic operations performed between two registers. Load
and store instructions are also R-type instructions.
• J-type: jump and jump-and-link instructions
The instruction layout for each instruction type is shown in the Figure 2.2.
Figure 2.2: ASPIDA DLX Instruction Layout
10
The integer subset of the DLX ISA is shown in the figures following.
Supported instructions are ticked in the following Figure 2.3.
Figure 2.3: Supported Integer Instructions
In this implementation, the DLX is de-synchronized. The global clock is
removed and is replaced by handshaking controllers. The flip-flops are replaced by
latch pairs. The figure below shows the datapath of the de-synchronized DLX. As
can be seen, the latches that separate the datapath stages are locally clocked by
11
controllers, which are responsible for producing the appropriate signals so that the
data move safely from one pipeline stage to the next.
Figure 2.4: De-synchronized DLX datapath
2.2.2 University of Stuttgart DLX Project
The University of Stuttgart DLX project is part of the VLSI Design Course
by Gumm [4] completed in December 1995 [5]. The DLX processor design abides
by Hennessey and Patterson’s original DLX RISC machine proposal.
In this project, only a subset of the original instruction set was implemented.
While the original DLX instruction set contained, among others, instructions for
signed and unsigned integer arithmetic and floating point arithmetic, support for
floating point arithmetic was not implemented in this design.
On the other hand, their processor model is extended by some features:
interrupt- and exception-handling, three different operation modes (supervisor, user,
and error) and one additional addressing mode. However, pipelining has not been
12
implemented. The interrupt and exception handling was derived from the DLXm
model which was done by the Alliance design suite and to some limited extent, from
the SPARC architecture. The memory addressing was also derived and simplified
from the SPARC definition. The timing models for the bus transactions have been
derived from the DP32 processor model.
The University of Stuttgart’s processor architecture was called the “DLXS”
where “S” stands for ‘Stuttgart’ version.
The Stuttgart processor design – DLXS – was selected for implementation in
the previous DLX project in UTM. Since DLXS was not synthesizeable due to the
absence of a reference clock (processor control was done through control signals
from the test bench), the source codes were modified to support a global reference
clock.
More details on the DLXS external and internal interface will be covered
more extensively in the following chapter on the DLX architecture overview.
2.3 Pipelining Design Considerations
In computing, a pipeline is a set of data processing elements connected in
series, so that the output of one element is the input of the next one. The elements of
a pipeline are often executed in parallel or in time-sliced fashion; in that case, some
amount of buffer storage is often inserted between elements.
An instruction pipeline is a technique used in the design of computers and
other digital electronic devices to increase their performance. Pipelining reduces
cycle time of a processor and hence increases instruction throughput, the number of
instructions that can be executed in a unit of time. But pipelining does not help in all
cases. There are several disadvantages associated. An instruction pipeline is said to
13
be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that
is not fully pipelined has wait cycles that delay the progress of the pipeline.
Pipelining doesn't decrease the time for a single datum to be processed; it
only increases the throughput of the system when processing a stream of data. at the
same time, a pipelined system typically requires more resources (circuit elements,
processing units, computer memory, etc.) than one that executes one batch at a time,
because its stages cannot reuse the resources of a previous stage. Moreover,
pipelining may increase the time it takes for an instruction to finish.
One key aspect of pipeline design is balancing pipeline stages. Another
design consideration is the provision of adequate buffering between the pipeline
stages — especially when the processing times are irregular, or when data items may
be created or destroyed along the pipeline.
The advantage of pipelining is that the cycle time of the processor is reduced,
thus increasing instruction bandwidth in most cases. However, the advantages of not
pipelining include:
• The processor executes only a single instruction at a time. This prevents
branch delays (in effect, every branch is delayed) and problems with serial
instructions being executed concurrently. Consequently the design is simpler
and cheaper to manufacture.
• The instruction latency in a non-pipelined processor is slightly lower than in a
pipelined equivalent. This is due to the fact that extra flip flops must be added
to the data path of a pipelined processor.
• A non-pipelined processor will have a stable instruction bandwidth. The
performance of a pipelined processor is much harder to predict and may vary
more widely between different programs.
Many designs include pipelines as long as 7, 10 and even 31 stages (like in
the Intel Pentium 4). The Xelerator X10q has a pipeline more than a thousand stages
14
long. The downside of a long pipeline is when a program branches, the entire
pipeline must be flushed, a problem that branch predicting helps to alleviate. Branch
predicting itself can end up exacerbating the problem if branches are predicted
poorly. In certain applications, such as supercomputing, programs are specially
written to rarely branch and so very long pipelines are ideal to speed up the
computations, as long pipelines are designed to reduce clocks per instruction (CPI).
Branching happens constantly, however, in many common applications such as
office software, significantly reducing the speed gain of pipelining.
The higher throughput of pipelines falls short when the executed code
contains many branches: the processor cannot know where to read the next
instruction, and must wait for the branch instruction to finish, leaving the pipeline
behind it empty. After the branch is resolved, the next instruction has to travel all the
way through the pipeline before its result becomes available and the processor
appears to "work" again. In the extreme case, the performance of a pipelined
processor could theoretically approach that of an un-pipelined processor, or even
slightly worse if all but one pipeline stages are idle and a small overhead is present
between stages.
Because of the instruction pipeline, code that the processor loads will not
immediately execute. Due to this, updates in the code very near the current location
of execution may not take effect because they are already loaded into the Prefetch
Input Queue. Instruction caches make this phenomenon even worse. This is only
relevant to self-modifying programs.
Figure 2.5: Classic 5-stage instruction pipeline
CHAPTER 3
DLX PROCESSOR ARCHITECTURE
3.1 Overview
The DLX processor was first defined as a hypothetical RISC machine with a
simple 32-bit load/store architecture. It is well suited for teaching purposes because
of its simple instruction set, its single addressing mode, the simple decoding of its
instruction set and its easily understandable architecture. However, this architecture
still demonstrated all the major features of the RISC principle.
3.2 External View of the DLX
The following subsections present the high-level overview of the DLX
processor architecture, its external interfaces, and its underlying submodules.
3.2.1 The DLX Interface
The external interface of the DLX has an address bus (ADDR) and a
bidirectional data bus (DATA), both 32-bits wide. The output lines RW, ENABLE
and READY are needed to handle memory access. The DLX also has an
asynchronous reset input (RESET) and a disable input (HALT). A two=phase, non-
16
overlapping clock signal is expected at the clock inputs PHI1 and PHI2. The
ERROR output indicates that the processor has reached an unrecoverable error state.
Figure 3.1: External interface of the DLX
3.2.2 Memory Interface and Access
A DLX word is 32-bits long. Memory is byte addressable with a 320bit
address in big-endian mode. All memory references are through loads and stores
between memory and the general-purpose registers. Accesses can be to byte, half-
word or work lengths. In the case of half access, the address must be “half-aligned”.
All instructions are 32-bits wide and must we work-aligned.
Figure 3.2: Byte positions in a DLX word (big-endian)
17
Addressing is only possible in two modes: immediate-addressing and
register-indexed addressing. During read/write accesses, the four-bit ENABLE
output and the RW output determine the kind of bus transaction as stated in the
following table:
Figure 3.3: Memory Operation Modes
The timing of the bus read transactions are shown below. During an idle state
(Ti), the processor places the memory address on the address bus (after the rising
edge of phi2) to start the transaction. In the next state (T1), the processor activates
the RW and ENABLE lines and waits for the memory to access the data. If the
memory has completed its operation during this state (T1) or the following state
(R2), it asserts ready and the processor completes the transaction by resetting the
ENABLE lines (after the rising edge of phi1) and continues with idle states.
Otherwise, the memory leaves the READY line false and the processor repeats T2
states until it detects READY to be true.
18
Figure 3.4: Memory Read Access
The timing of the bus write transaction is shown in the next figure. During an
idle state (Ti), the processor places the memory address on the address bys to start
the transaction. In the next state (T1) the processor places the data on the data bus.
In the following T2 state, the processor activates the RW and ENABLE lines. If the
memory has completed its operation during this state, it asserts ready and the
processor completed the transaction by resetting the ENABLE lines and continues
with idle states. Otherwise, the memory leaves the READY line false and the
processor repeats T2 states until it detects READY to be true.
Figure 3.5: Memory Write Access
19
3.2.3 Reset
An activation of the RESET-input of the DLX changes the port direction of
the bi-directional data-bus DATA to input and the outputs ENABLE and RW are set
to zero. The registers affected by the reset are described in a later section.
3.2.4 Halt
Before fetching an instruction, the DLX processor checks the HALT input. If
the input is active, the processor changes to the inactive state. All output ports are
set to high impedance state and the DATA bus is switched to input. The processor
stays inactive until the HALT signal is set back to zero and continues afterwards with
the normal instruction fetch.
3.2.5 Error
In the case of error detection, the processor stops any further operation and
changes the error state. This is indicated by the setting of the output ERROR to high.
All output ports are set to high impedance state and the DATA bus is switched to
input. The processor has to be restarted or reset.
20
3.3 The DLX Programming Model
3.3.1 Accessible Registers
The following registers are programmer-accessible:
• R0, …, R31: 32 general purpose registers (GPR) of 32-bits wide. The
value of R0 is always 0, i.e. the register can only be read and a write
does not change its content. Register R31 serves also as an address
storage when executing call instructions.
• ICR: The interrupt control and exception register. The content of this
register can be moved to a general-purpose register and vice-versa
• IAR: the interrupt address register: this register stores the address of
the next instruction when an interrupt is executed. The content of this
register can be moved to a GPR and vice versa
• TBR: the trap base register. This register sores the base address of the
interrupt and exception handling routines in the memory. The content
of this register can be moved to a GPR and vice versa.
Other registers are not accessible to the programmer. Among these registers
are the program counter (PC) and the instruction register (IR). The DLX has a load-
store architecture, that is all arithmetic and logic operations are limited to between
GPRs and the memory access is done via those registers as well.
3.3.2 The DLX instruction format
All DLS instructions are 32-bits wide with a 6-bit primary opcode. There are
only three different instruction formats: the J-type, the I-type and the R-type.
21
Figure 3.6: DLX instruction formats
I-type instructions are used to encode the load-store instructions with
immediate displacement. In this case, RS1 encodes the GPR which holds the
memory address, RD encodes the GPR to read from or write in respectively, and the
16-bit sign extended immediate is the displacement value. Furthermore, this
instruction format encodes the conditional branch instructions.
R-type instructions encode the register to register ALU operation where the
bit field func encodes the ALU operation: RD � RS1 func RS2 as well as the
register indexed load/store instructions. In this case, RS2 encodes the PGR which
holds the memory address, RD encodes the GPR to read from or write to, and RS2
encodes the register which holds the displacement value.
J-type instructions encode only the unconditional branches where the 26-bit
singed-extended immediate is added to the program counter (PC) and two special
instructions.
22
3.3.3 The DLX instruction set
The DLX possesses 52 instructions which can be classified into:
• 18 arithmetic and logic instructions
• 12 test instructions
• 6 branch instructions
• 12 memory access instructions
• 4 special instructions
All load/store instructions exist in two formats: addressing with immediate
displacement (16-bit) and register indexed:
• LW Rd, Rs2 (Rs1) (Rd � Mem[Rs1 + Rs2])
• LW.I Rd, I(Rs1) (Rd � Mem[Rs1 + I])
The arithmetic and logic operations, except LHI and NOP, are executable two
formats: register-to-register and register-immediate:
• ADD Rd, Rs1, Rs2 (Rd � Rs1 + Rs2)
• ADD.I Rd, Rs1, I (Rd � Rs1 + I)
The complete listing is shown in Table 3.1, 3.2 and 3.3. The 16-bit
immediate value for I is signed-extended for arithmetic instructions and zero
extended for logical instructions. The instruction LHI allows to write a 16-bit
immediate value in the upper half-word of a GPR whereas the lower half word is set
to zero.
23
Table 3.1: DLX arithmetic and logic instructions
There are 12 test instructions which test a relation between either the contents
of two GPRs or the content of one GPR and a 16-bit sign-extended immediate value.
If the result is true, the destination register is set to 1, otherwise it is set to 0.
24
The DLX possesses only two conditional branch instructions: branch on
equal to zero and branch on not equal to zero. Besides the two conditional branch
instructions, there are four unconditional branch instructions.
Table 3.2: DLX test and set instructions
25
Table 3.3: DLX branch instructions
The DLX also possesses four special instructions. Two decode moves
between special registers i.e. ICR and IAR, or the TBR, and the GPR. Another 2 are
for interrupt and exception handling.
Table 3.4: DLX special instructions
26
The last class of instructions are the load-store instructions listed in Table 3.5.
Table 3.5: DLX load-store instructions
3.4 Internal Structure of the DLX
The internal structure for the DLX is quite simple and easy to comprehend. It
consists of two main components: the datapath and the controller. The datapath
executed all operation on data. It contains all the registers, the ALU, and the internal
data busses for the interconnect. The controller generates the control sequence of the
27
control signals which are necessary for the correct flow of the data in the datapath.
The signals, exchanged by the two main components are separated into five classes.
Figure 3.7: Internal structure of the DLX
3.4.1 The Datapath
The structure of the non-pipelined DLX datapath is depicted in the following
figure. The general-purpose registers R0-R31 are contained in the register file. The
functions of ICR, IAR and TBR have already been mentioned. The PC holds the
address of the instruction which is to be executed next while the IR holds the current
instruction. The memory data register (MDR) contains the data to be written into the
memory in case of a write access or the data read form the memory in the case of a
read access. The memory access register (MAR) contains the address of the
concerned memory location. The MAR can also be used as a temporary register to
store intermediate results of a calculation.
The processor uses three internal busses: the source1 bus (S1), the source2
bus (S2) and the destination bus (Dest). The fundamental operation of the datapath is
reading operands from the register file, operation on the in the LAU, and then writing
the result back into the register file. Since the register file does not need to be read
28
and written every clock cycle, this sequence is broken into multiple clock cycles to
allow for shorter clock periods.
Figure 3.8: DLX datapath (non-pipelined)
29
The ALU can perform the following operations as denoted in Table 3.6:
Table 3.6: ALU operations
3.4.2 The Control Unit
The structure of the DLX control unit for the non-pipelined datapath is
depicted in the following figure. It consists of the central finite state machine (FMS),
the instruction register (IR), 2 instruction decoders and additional logic.
The FSM has 64 different states which change with the rising edge of phi1. It
generated seven groups of control signals which are transmitted to the datapath for its
operation: rs1_enable, rs2_enable, dest_enable, alu_ctrl, reg_file_ctrl, memory_ctrl
and various_ctrl. The signal groups rs1_enable and rs2_enable are composed of the
output enable signals for all registers which are connected to the S1 and S2 bus. The
dest_enable signals enable the load from the Dest bus for all registers which are
connected to it. The alu_ctrl signal selects the required ALU function, the
reg_file_ctrl signal controls the load from the C register into the register file and the
load form it into the A and B registers. The memory_ctrl signal is used for
30
controlling the memory operations. Finally there are some additional control signals
which are grouped in various_ctrl.
The instruction decoder DEC1 decodes all the instructions except for the
memory instructions which are decoded using DEC2. The decoder DEC3 is used to
generate control signals for the generation of the register file addresses. The IR
contains the actual instruction, and it has two outputs connected to the S1 and S2
busses for the immediate values.
Figure 3.9: Structure of the DLX Control Unit
3.4.3 The Basic Execution Steps
Instructions in the DLX instruction set can be broken into five basic steps:
fetch, decode, execute, memory access, and write-back. This is what allows the
processor to enable pipelining of the instructions execution although instructions
may also be executed in sequence, one at a time to completion before the start of the
next instruction. Each step may take one or several clock cycles.
31
1. Instruction fetch step:
IR ← Mem[PC];
Fetch instructions from memory
2. Instruction decode and operand fetch step:
A ← Rs1; B ← Rs2; PC ← PC + 4;
Decode the instruction. Access the register file to read the registers. This
can be done in parallel with the decoding because the source registers have
always the same location in the instruction formats (fixed-field decoding).
Thus the A and B registers are loaded always in this step, regardless if their
contents will be used afterwards or not. Increment the PC to point to the next
instruction.
3. Execution step:
a. Memory reference:
MAR ← A + (IR16)16##IR16:31;
MDR ← Rd
The ALU is adding the operands to form the effective address, the
MDR is loaded for a store
b. ALU instruction:
C ← A op B
The ALU is performing the specified operation the result is stored in
C.
c. Branch/Jump:
Cond ← A op 0 (conditional branch instruction)
PC ← PC + (IR6)6##IR26:31;
In case of a conditional branch, the ALU performs a relative
operation. In the case of an unconditional jump, the ALU is adding
the two operands to form the effective branch address which is stored
in the PC. In the case of a jump-and-link instruction, the PC is saved
in the IAR before the jump is taken.
32
4. Memory access / branch completion step:
a. Memory reference:
MDR ← Mem[MAR]; C ← MDR (load instruction)
Mem[MAR] ← MDR; (store instruction)
b. Conditional branch:
If (cond) PC ← PC + (IR16)16##IR16:31;
In the case of a conditional branch, add the two operands to form the
effective branch address and store the result in the PC if cond is true.
5. Write back step:
Rd ← C
Write the result into the register file.
CHAPTER 4
DESIGN WORKFLOW, METHODOLOGY AND TOOLS
4.1 Design Workflow
The first stage of the project is focused on literature review and study of the
DLX processor architecture. This is necessary in order to understand the VHDL
coding of the DLX processor.
Study of DLX
Architecture
Study of pipelining
concepts
Datapath redesign
Control unit redesignPipeline Control Unit
Design
Reproduce previous
project results
FPGA Implementation
& validation
Module simulations
and validation
Full simulation and
design validation
DLX modules
integration
Future work
exploration
Performance
evaluation & analysis
Figure 4.1: Design Methodology Workflow
34
The next key ingredient to the project would be pipelining. Hence, a study of
pipelining concepts, which includes pipeline design consideration, limitations,
advantages and disadvantages were investigated. Pipelining in the context of the
DLX processor was also looked into.
Using the previous DLX processor work, several instructions were re-
simulated in the Quartus software to ascertain the functionality of the previous
design as well as familiarize with the project’s source code. At this stage, more
examination into the source code is also done in order to draft out the redesigned
pipelined datapath.
Subsequently, the next stage is the actual datapath and control unit redesign
to enable instruction pipelining. More details on the block diagram of the pipelined
datapath is documented in the next chapter of this report. This stage also involves
simulations within the Quartus tool to ascertain the functionality of all the
instructions of the DLX are validated and verified.
Once all the submodules are designed, integration work is undertaken to
complete the entire DLX processor post pipelining implementation completion. One
initial aspiration of the project to implement the design on FPGA was not realized
due to time constraints in the duration of this project. Therefore, all validation has
only been carried out up to functional timing simulation within Quartus.
The final steps involved analyzing the performance of the redesigned DLX
core with pipelining versus the non-pipelined start-point. At this stage, limitations
are also noted for future work recommendations
35
4.2 Tools
4.2.1 Altera Quartus II 6.0 Web Edition
The Altera® Quartus® II design software provides a complete, multiplatform
design environment. It is a comprehensive environment for system-on-a-
programmable-chip (SOPC) design.
The free Quartus II Web Edition software includes everything needed to
design for Altera’s low-cost FPGA and CPLD families. Features include:
• Schematic- and text-based design entry
• Integrated VHDL, Verilog HDL, and SystemVerilog synthesis and support for
third-party synthesis software
• SOPC Builder system generation software
• Place-and-route, verification, and programming functions
• TimeQuest timing analyzer
• Timing optimization advisor
• Resource optimization advisor
36
Figure 4.2: Quartus II Design Flow
The Quartus® II design software delivers the highest productivity and
performance for FPGAs, CPLDs, and structured ASICs and offers numerous design
features to accelerate the design process:
• Incremental compilation to reduce the design cycle time
• SOPC Builder for system-level design
• MegaWizard® Plug-In Manager to quickly and easily integrate a broad
portfolio of intellectual property (IP) cores
• Power analysis tools to meet stringent power requirements
• A memory compiler function to easily use embedded memory
The Quartus II software supports VHDL and Verilog HDL design entry,
graphical-based design entry methods, and integrated system-level design tools. The
Quartus II software integrates design, synthesis, place-and-route, and verification
into a seamless environment, including interfaces to third-party EDA tools.
37
Quartus II integrated synthesis (QIS) supports SystemVerilog-2005, Verilog-
2001, Verilog-1995, VHDL 1993, and VHDL 1987 standards, and also supports
Altera AHDL and schematic (block design file) design entry.
QIS includes advanced synthesis options and compiler directives (attributes)
to guide the synthesis process to achieve optimal results. Included in these synthesis
options is the PowerPlay power analysis and optimization option and the multiplexer
option. The PowerPlay power optimization option controls how aggressive synthesis
optimizes the design for power. The multiplexer optimization option takes advantage
of Altera FPGA architectural features to reduce device area usage up to 20 percent to
fit designs into a smaller device and save cost.
4.2.2 VHDL
VHDL stands for Very-High-Speed-Integrated-Circuit Hardware Description
Language. VHDL is used in the reference DLX project for describing and coding the
DLX processor. VHDL can describe the behaviour and structure of electronic
systems, but is particularly suited as a language to describe the structure and
behaviour of digital electronic hardware designs, such as ASICs and FPGAs as well
as conventional digital circuits.
VHDL is a notation, and is precisely and completely defined by the Language
Reference Manual (LRM). This sets VHDL apart from other hardware description
languages, which are to some extent defined in an ad hoc way by the behaviour of
tools that use them. VHDL is an international standard, regulated by the IEEE. The
definition of the language is non-proprietary.
VHDL is not an information model, a database schema, a simulator, a toolset
or a methodology! However, a methodology and a toolset are essential for the
effective use of VHDL.
38
Simulation and synthesis are the two main kinds of tools which operate on the
VHDL language. The Language Reference Manual does not define a simulator, but
unambiguously defines what each simulator must do with each part of the language.
VHDL does not constrain the user to one style of description. VHDL allows
designs to be described using any methodology - top down, bottom up or middle out.
VHDL can be used to describe hardware at the gate level or in a more abstract way.
39
CHAPTER 5
PIPELINED DLX COMPONENT DESIGN
5.1 Pipelined DLX overview
The overall internal structure of the DLX remains as what was covered in
Chapter 3, whereby the two main components of the DLX are still the controller and
the datapath structure, as depicted in the diagram below.
Figure 5.1: DLX Internal Structural View
In the following pages, the detailed changes made inside the datapath and the
control unit are documented.
Controller
Datapath
Clock
Reset
Memory
Address
Interrupt
Status
Control
Instruction
Ready
Enable
RW
Halt
IR
40
5.2 Pipelined Datapath
Figure 5.2 below showcases the high-level block diagram of the redesigned
DLX datapath to support pipelining.
Figure 5.2: DLX Pipelined Datapath
From the diagram, the most apparent change that can be noticed are the
addition of multiplexers at the inputs of the register file, PC and register C. These
multiplexers are needed in to support the new pipelined nature of the datapath. The
necessity of these multiplexers will become clear when we look at the control unit
L1
L2
ALU
Register
File
add_4
PC
LMDR
SMDR
IR
A
B
C aluoutput
MAR
data_in
instr_addr
instr_in
Controller
data_out
PC1
41
implementation, and the steps being executed concurrently in each pipestage that
required the datapath to perform multiple operations at the same time. The next
section covers the micro-instructions that are executed in the datapath for each
different class of instruction.
5.2.1 Load/Store Instructions
During a Load instruction, the datapath is performing the following 3
operations at the same time for these pipestages:
• EXE: MAR � A + immed
• MEM: LMDR � MEM[MAR]
• WB: RD � LMDR
Whereas during a Store instruction:
• EXE: MAR � A + immed
SMDR � B
• MEM: MEM[MAR] � SMDR
• WB: {idle}
For this purpose, the single memory data register that was sufficient in the
non-pipelined DLX processor needs to be duplicated to have one MDR for load and
store.
42
5.2.2 Arithmetic/Logic Instructions
The following operations are concurrently run in the datapath during any
typical logic or arithmetic instruction execution:
• EXE: Aluout � A op B
• MEM: C � Aluout
• WB: Rd � C
5.2.3 Test and Set Instructions
The following operations are concurrently run in the datapath during any test
and set instruction execution:
• EXE: Rs1 sub Rs2/immed
Aluout � ‘1/’0
• MEM: C � Aluout
• WB: Rd � C
5.2.4 Branch/Jump Instructions
The following operations are concurrently run in the datapath when a branch
or jump instruction is encountered:
• ID: PC1 � PC
• EXE: MAR � immed + PC1
• MEM: PC � MAR (if cond)
• WB: {idle}
43
5.3 Redesigned Control Unit
The finite state machine (FSM) that controls the DLX is contained within the
pipe-control module, and dictates the processors transition between the pipeline
stages. The overall new control unit diagram is depicted below.
Figure 5.3: Control Unit for Pipelined DLX
The pipeline controller block contains the main FSM that determines the state
transitions of the DLX processor. The following pages will cover the function of
each block for IF, ID, EXE, MEM and WB.
Pipeline Controller
ID
EXE
MEM
WB
IR
IF
instr_in
add_pc
en
clock control signals
44
5.3.1 Instruction Fetch (IF)
The Instruction Fetch (IF) stage is responsible for fetching 32-bit long
instructions from memory. It also manages the program counter (PC) and does an
increment of the PC every time the ready signal is asserted.
Figure 5.4: Instruction Fetch (IF) Block Diagram
When the IF block receives a ready assertion from the memory, it checks if
the pipeline controller has a stall_fetch asserted. If not, it deasserts the
fetch_memory_not_ready signal to begin the instruction load from memory. At the
same time, the pc_latch is asserted to increment the program counter as well as the
instruction register.
IF
clock
reset
stall_fetch
ready
fetch_memory_not_ready
pc_latch
fetch_mem_ctrl
ir_latch_en
45
5.3.2 Instruction Decode (ID)
The Instruction Decode stage is responsible for decoding the instruction in
the IR. Based on the type of instruction and its operands, it will fetch the values
from the registers or use the immediate values, and place them in register A and
register B for the subsequent stage. At the same time, it determines if the instruction
is a branch instruction, and will calculate the condition and target addresses.
Figure 5.5: Instruction Decode (ID) Block Diagram
The values for register A and register B are parsed using the rs1_out and
rs2_out signals. But before that is done, it checks if the stage is stalled by checking
the assertion on the stall signal that arrives at the ID block from the pipeline
controller block.
ID
clock
stall
instr_in
reg_value
ir_1_latch
rs2_out
rs1_out
46
5.3.3 Execution Stage (EXE)
The execution (EXE) stage is very much similar to the original control unit
from the non-pipelined DLX processor. Most of the control signals that go to the
datapath which are responsible for the correct operation of the datapath’s registers,
and ALU operations, originate from this block.
Figure 5.6: Execute (EXE) Block Diagram
Unique to the pipelined version of this block, there is an additional stall
signal as an input to the EXE module, which is controlled by the pipeline controller.
Its function is the same as in the ID block; to stall the pipeline. Therefore, before any
assertions are made to the output of the EXE block, the stall signal is checked.
EXE
clock
stall
reset
s1_enab
s2_enab
alu_op_sel
lmdr_latch
dest_enab
const_sel
immed_sel
exc_enab
icr_shifter
iar_mux
iar_latch
test_set_mux
smdr_latch
alu_neg
alu_zero
47
5.3.4 Memory Stage (MEM)
As its name suggests, the memory stage (MEM) is responsible to accessing
the data memory for load/store instructions. The ready signal is the assertion
received from the memory indicating that there data on the address bus is valid. The
output signals from the MEM block go to the pipeline controller as well as the
datapath, to control the memory data registers and their enable signals, to load the
appropriate data into the registers.
Figure 5.7: Memory (MEM) Block Diagram
One can notice that there is no stall signal from the pipeline controller to this
stage and the subsequent write-back stage as well. This is because once the
execution (EXE) stage is reached, there is no need for any stalls in the pipeline, since
any need to stall the pipeline is only determined by the operation, which will be fully
decoded by the instruction-decode (ID) stage.
MEM
clock
reset
ready
mem_ctrl
entry_mux
pc_mux
c_latch
lmdr_latch
lmdr_ctrl
memory_not_ready
adr_ls2
48
5.3.5 Write Back (WB)
The write-back block is the simplest block in the control unit and is only
responsible for storing instructions back into the destination register, Rd. The rf_in
signal enables the register file for writing and the content of register C is written in
the register denoted by the rd signal.
Figure 5.8: Write-Back (WB) Block Diagram
WB
clock
wb_mux
rd
rf_in
CHAPTER 6
RESULTS AND PERFORMANCE ANALYSIS
6.1 Overview
Once all the submodules in the new control unit were completed and
validated, these submodules were integrated to form the new control unit, and
subsequently integrated with the redesigned datapath to form the DLX processor.
Several functional simulations were carried out within Quatus tool to verify
and validate the functionality of the new DLX with pipelining.
6.2 Functional Validation
A simple program was written and its output waveform was observed in
Quartus waveform viewer compilation, synthesis and simulation. The following 3-
instruction program was used:
addi r3,r7,0x4400
addi r7,r15,0x4444
xor r17,r3,r15
50
The instructions are then translated into machine code, as depicted in the
following instruction breakdown:
DLX Instruction : addi r3,r7,0x4400
Details : add immediate value 0x4400 to R7 and store in R3
Instruction format : I-type
0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1
0 5 6 10 11 15 16 31
0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
opcode Rs1 Rd Immediate
Machine Code : 0022C70A (big-endian)
DLX Instruction : addi r7,r15,0x4444
Details : add immediate value 0x4444 to R15 and store in R7
Instruction format : I-type
Machine Code : 2222F70A (big-endian)
DLX Instruction : xor r17,r3,r15
Details : exclusive-or contents of R15 and R3, and store in R17
Instruction format : R-type
Machine Code : 5011F600 (big-endian)
These instructions are then fed into the DLX processor through the data bus
and its simulation output waveform is capture in Quartus as shown subsequently.
51
Figure 6.1: Simulation Waveform
From the timing simulation pattern, 0x0022C70A can be observed at the data
bus in (d_bus_in). The resulting immediate value 0x4400 can be seen at reg_c_in
which is the output register of the ALU and the input to the register file in the
datapath where the result will be written into R3.
The next add-immediate instruction is also observed on the data bus,
0x2222F70A, and its results can be observed at the reg_c_in value. Finally, the
exclusive-or instruction 0x5011F600 is placed in the data bus. When the xor
instruction is executed, reg_c_in shows the value of 0x0044 which is R3 ⊕ R15 =
0x4400 ⊕ 0x4444 = 0x0044.
From this simple simulation, it is concluded that the functional properties of
the DLX is intact and the operation results are accurate.
52
6.3 Gate Count and Frequency Statistics
Using the Quartus classic timing analyzer tool, the redesigned DLX processor
can be analyzed for post-synthesis gate count as well as maximum frequency
attainable, Fmax.
Figure 6.2: Gate Count Snapshot
Based on the Quartus fitter summary report, the new pipelined DLX utilizes
5096 logic elements, versus 4198 logic elements utilized by the non-pipelined DLX
processor. This represents a 21.39% increase in logic utilization with the
introduction of the redesign datapath and control unit that contains all the new
submodules for pipeline stages.
53
Figure 6.3: Redesigned DLX Fmax
Looking at the Quartus classic timing analyzer report, the new DLX can
achieve a maximum clock frequency, Fmax, of 11.69Mhz. Compared to the original
non-pipelined DLX, this represents a 28.78% slow-down in Fmax whereby the non-
pipelined DLX achieved an Fmax of 16.67Mhz. This can likely be attributed to
possible critical paths within the new datapath or control unit that is impeding the
speed of the processor.
6.4 Instruction Execution Speed-Up
The true measure of the effectiveness of the new pipelined architecture of the
DLX is determined by the instruction speed-up attainable through the introduction of
the instruction pipeline. A comparion was made better the non-pipelined and
pipelined-DLX running the exact same instructions.
54
Figure 6.4: Non-pipelined DLX execution waveform
Figure 6.5: Pipelined DLX execution waveform
Running the exact same 3 instructions, the pipelined-DLX completes its
execution 3 clock-cyles after than the non-pipelined DLX, accounting for a 25%
instruction speed-up (12 versus 9 clock cycles).
55
Based on the understanding of pipelined microprocessors, it is a known fact
that the best case instruction speed up attainable equals the number of pipestage of
the design. Therefore, hypothetically, the pipelined DLX can offer up to 5 times
performance improvement in terms of clock cycles reduction, if the program
executed involves thousands of instructions and the percentage of branch instructions
in the program are kept to a minimal since branch instructions will cause a 3-cycle
penalty to stall the pipeline and fetch the new instruction.
Table 6.1 summarizes the metrics of comparison between the non-pipelined
versus pipelined DLX processor:
Table 6.1: Non-pipelined versus Pipelined DLX
Non-pipelined DLX Pipelined DLX Difference (%)
Gate Count 4198 5096 + 21.39%
Maximum frequency, Fmax 16.67MHz 11.67MHz - 30.00%
Number of clock cycles 12 cycles 9 cycles - 25.00%
Speed-up 1 1.25 + 25.00%
CHAPTER 7
CONCLUSION AND FUTURE WORK RECOMMENTATIONS
7.1 Conclusion
The pipelined DLX processor was successfully designed and implemented
using VHDL based loosely on the previous DLX project carried out in UTM. The
redesigned DLX processor was successfully simulation using Altera’s Quartus 2 7.2
Web Edition software suite.
The pipelined DLX utilized a five-stage instruction pipeline (instruction
fetch, instruction decode, execute, memory and write-back) to operate on instruction
fed into the processor.
This project consisted of several phases of work. In this first part of the
project, much effort was been spent to understand the DLX architecture and source
code, as well as delving into pipelining concepts and design considerations. The
next stage involved successfully re-simulated and verifying the previous non-
pipelined DLX in Quartus. This is imperative to verifying the functionality of the
previous design can be reproduced, as well as strengthens knowledge of the DLX
architecture and familiarizing with the Quartus design tools.
The two main components of the DLX – the datapath and the control unit -
were redesigned to enable pipelined execution of instructions. Once the VHDL
coding of the submodules were completed, all the sub-blocks were integrated and
57
validated in Quartus. A lot of time and effort was spent iterating between coding the
blocks, integrating the design and validating the implementation in Quartus.
Once the design was functionally validated, performance analysis work was
carried out, focusing on comparing the previous non-pipelined DLX versus the
redesigned pipelined-DLX processor.
The pipelined DLX showcased a 25% instruction execution speedup
measured by number of clock cycles, as compared to the non-pipelined DLX that
came at a cost of 23% increase in logic element utilization as measured by Quartus
tool. Finally, exploration of possible future work to further improve the performance
of the pipelined DLX processor was done.
Throughout the design, implementation and validations stages of the project,
numerous hindrances were encountered, which includes lack VHDL coding
proficiency, familiarizing with the Quartus tool, as well as handling branch
instructions in the pipelined architecture. Each setback was handled meticulously
and diligently.
In summation, a wealth of knowledge was gained from this project which
could not be lesson-taught. In depth knowledge of computer architecture, VHDL
coding, pipelining concepts and implementation, branching handling in
microprocessor design as well as generic skills were among the expertise acquired
through this project.
7.2 Recommendations for Future Work
Among several possible future work recommendations presented here, the
most pivotal would be a full scale implementation of the pipelined DLX on FPGA
(and possibly fabricated ASIC). With the design implemented on FPGA, real world
58
performance of the pipelined DLX can be measures, particularly with the presence of
real external memory interactions. This will introduce the need for better timing
synchronization and handling between the DLX processor and the external memory.
Another possible path to explore in further improving the performance of the
DLX processor would be the introduction of a branch-prediction algorithm and
hardware module. This can be realized through multiple fetches from memory per
load cycle and a sub-module that looks-ahead two instruction in advance to prepare
for branch instructions in the program. This will eliminate the 3-cycle penalty paid
whenever an instruction in the pipeline is decoded as a branch, resulting in the entire
pipeline being flushed.
The addition of a cache (either data cache or instruction cache) would also
significantly increase the performance of the DLX processor. In this case, an
instruction cache would be more ideal to work with the pipelined architecture since
latency to fetch the instructions from the memory can be reduced to a minimum if
instructions are cached, regardless of whether the branch prediction unit is
implemented.
REFERENCES
1. Hennessy, John L and Peterson, David A (1990). Computer Architecture: A
Quantitative Approach. USA: Morgan Kauffmann. San Francisco, USA.
2. Rajagopal, Selvakumar. FPGA Implementation of DLX Microprocessor with
Wishbone SoC Bus. Bachelor’s Thesis. Universiti Teknologi Malaysia; 2005
3. Amde, M.; Blunno, I. and Sotiriou, C.P.; (2003). Automating the Design of an
Asynchronous DLX Microprocessor. Proceedings of 40th
Design Automation
Conference (DAC), 2-6 June 2003 Page(s):502 - 507.
4. Gumm, Martin (1995). VHDL Modeling and Synthesis of the DLXS RISC
Processor. Germany: University of Stuttgart
5. Buhler, M. and Baitinger, U.G.(1998). VHDL-based development of a 32-bit
pipelined RISC processor for educational purposes, Ninth Mediterranean
Electrotechnical Conference (MELECON 98), Volume 1, 18-20 May 1998
Page(s):138 - 142 vol.1.
6. Ashenden, Peter J. (2002). The Designer’s Guide to VHDL, 2e, Morgan
Kaufmann, San Francisco.
APPENDIX A
VHDL SOURCE CODE FOR STRUCTURAL DLX CORE
//dlx.vhd
library IEEE;
USE IEEE.std_logic_1164.ALL;
use WORK.dlx_instructions.all;
use WORK.control_types_2.all;
USE WORK.dlx_types.all;
entity dlxp is
port (
clock : in std_logic;
a_bus : out dlx_address;
d_bus_in : in dlx_word;
d_bus_out : out dlx_word;
enable : out dlx_nibble;
rw : out std_logic;
error : out std_logic;
ready : in std_logic;
reset : in std_logic; -- asynchronous reset
halt : in std_logic; -- freeze of processor state
intrpt : in dlx_nibble; -- interupt signals (maskable)
pad_out_en : out std_logic; -- output pads enable (0 = out, 1 = tri)
pad_io_sw : out std_logic; -- io pads switch (0 = output, 1 =
input)
---------------------------------------------------------------
instr_out : out dlx_word;
src_1 : out dlx_word;
src_2 : out dlx_word;
dst : out dlx_word;
rs1_out : out dlx_reg_addr;
rs2_out : out dlx_reg_addr;
rd_out : out dlx_reg_addr;
--
-- control outputs
--
s1_enab : out std_logic_vector(0 to 5); -- select s1 source
s2_enab : out std_logic_vector(0 to 3); -- select s2_source
dest_enab : out std_logic_vector(0 to 4); -- select destination
alu_op_sel : out std_logic_vector(0 to 3); -- alu operation
const_sel : out std_logic_vector(0 to 1); -- select const for s1
--rf_op_sel : out std_logic_vector(0 to 2); -- select reg file
operation
immed_sel : out std_logic_vector(0 to 1); -- select immediate
from ir
61
exc_enab : out std_logic_vector(0 to 8); -- enable set exception
bit
mem_ctrl : out std_logic_vector(0 to 7); -- memory control lines
reg_c_in : out std_logic_vector(31 downto 0);
-- regf2out : out dlx_word;
lmdr_latch : out std_logic;
fetch_mem_ctrl : out std_logic
-- instr. reg. content
);
end dlxp;
--------------------------------------------------------------------------
-- Structural architecture of the datapath
--
-- file datapath-structural.vhd
--------------------------------------------------------------------------
architecture structural of datapath is
component bus_const32
port (
q1 : out dlx_word;
q2 : out dlx_word;
out_en1 : in std_logic;
out_en2 : in std_logic;
sel : in std_logic_vector(0 to 1));
end component;
component word_mux2
port (in0, in1 : in dlx_word;
y : out dlx_word;
sel : in std_logic);
end component;
component word_latch
port (
clock : in std_logic;
d : in dlx_word;
q : out dlx_word;
latch_en : std_logic);
end component;
component word_reg_1e
port (
clock : in std_logic;
d : in dlx_word;
q : out dlx_word;
latch_en : in std_logic;
out_en : in std_logic);
end component;
component word_reg_1e1
port (
clock : in std_logic;
d : in dlx_word;
q1, q2 : out dlx_word bus;
latch_en : in std_logic;
out_en1 : in std_logic);
end component;
component mdr
port (
clock : in std_logic;
d : in dlx_word;
62
q1, q2 : out dlx_word;
latch_en : in std_logic;
out_en1 : in std_logic;
shift_ctrl : in std_logic_vector(0 to 2);
mar_ls2_in : in std_logic_vector(0 to 1));
end component;
component reg_file
port (
clock : in std_logic;
addr_out1 : in dlx_reg_addr;
q1 : out dlx_word;
addr_out2 : in dlx_reg_addr;
q2 : out dlx_word;
addr_in : in dlx_reg_addr;
d : in dlx_word;
write_en : in std_logic);
end component;
component icr
port (
clock : in std_logic;
d : in dlx_word; -- data in from dest_bus
latch_en : in std_logic; -- enable load from dest_bus
q : out dlx_word; -- output to s_bus
out_en : in std_logic; -- enable output to s_bus
--
s_en : in std_logic; -- set s bit
ioc_en : in std_logic; -- set ioc bit
irra_en : in std_logic; -- set irra bit
iav_en : in std_logic; -- set iav bit
dav_en : in std_logic; -- set dav bit
ovad_en : in std_logic; -- set ovad bit
ovar_en : in std_logic; -- set ovar bit
priv_en : in std_logic; -- set priv bit
super : out std_logic; -- supervisor bit
--
intrpt_in : in dlx_nibble; -- input from intrpt. port
intrpt_en : in std_logic; -- enable load from intrpt.
port
intrpt : out std_logic); -- at least one masked
interrupt active
end component;
component ir
port (
clock : in std_logic;
d : in dlx_word;
latch_en : in std_logic;
ir_out : out dlx_word;
immed_o1_en : in std_logic;
immed_out1 : out dlx_word;
immed_o2_en : in std_logic;
immed_out2 : out dlx_word;
immed_size : in std_logic; -- '0'-> 16 bit /
'1'-> 26 bit
immed_sign : in std_logic); -- '0'-> unsigned
/ '1' signed
end component;
component alu
port (
clock : in std_logic;
s1 : in dlx_word;
s2 : in dlx_word;
latch_en : in std_logic;
result : out dlx_word;
func : in dlx_nibble;
63
zero : out std_logic;
negative : out std_logic;
overflow : out std_logic);
end component;
--
-- internal busses
--
signal s1_bus : dlx_word;
signal s2_bus : dlx_word;
signal dest_bus : dlx_word;
signal addr_mux_in0 : dlx_word;
signal addr_mux_in1 : dlx_word;
signal mdr_in : dlx_word;
signal reg_file_out1: dlx_word;
signal reg_file_out2: dlx_word;
signal reg_file_in : dlx_word;
--
-- other lines
--
signal intrn_alu_overflow : std_logic;
begin
dp_alu : alu
port map (clock => clock,s1 => s1_bus, s2 => s2_bus, latch_en =>
alu_latch_en,
result => dest_bus, func => alu_func, zero => alu_zero,
negative => alu_negative, overflow => intrn_alu_overflow);
dp_reg_file : reg_file
port map (clock => clock,addr_out1 => reg_addr_rs1, q1 => reg_file_out1,
addr_out2 => reg_addr_rs2, q2 => reg_file_out2,
addr_in => reg_addr_rd, d => reg_file_in,
write_en => regf_wr_en);
a_reg : word_reg_1e
port map (clock => clock, d => reg_file_out1, q => s1_bus,
latch_en => a_latch_en, out_en => a_out_en);
b_reg : word_reg_1e
port map (clock => clock, d => reg_file_out2, q => s2_bus,
latch_en => b_latch_en, out_en => b_out_en);
c_reg : word_latch
port map (clock =>clock, d => dest_bus, q => reg_file_in, latch_en =>
c_latch_en);
pc_reg : word_reg_1e1
port map (clock => clock, d => dest_bus, q1 => s2_bus, q2 =>
addr_mux_in0,
latch_en => pc_latch_en, out_en1 => pc_out_en);
instr_reg : ir
port map (clock => clock, d => data_in, latch_en => ir_latch_en, ir_out
=> instr_out,
immed_o1_en => ir_immed_o1_en, immed_out1 => s1_bus,
immed_o2_en => ir_immed_o2_en, immed_out2 => s2_bus,
immed_size => ir_immed_size, immed_sign => ir_immed_sign);
icr_reg : icr
port map (clock => clock, d => dest_bus, q => s1_bus, latch_en =>
icr_latch_en,
out_en => icr_out_en,
s_en => icr_s_en, ioc_en => icr_ioc_en, irra_en => icr_irra_en,
iav_en => icr_iav_en, dav_en => icr_dav_en,
64
ovad_en => icr_ovad_en, ovar_en => icr_ovar_en,
priv_en => icr_priv_en,
super => icr_super, intrpt_in => icr_intrpt_in,
intrpt_en => icr_intrpt_en, intrpt => icr_intrpt);
iar_reg : word_reg_1e
port map (clock => clock, d => dest_bus, q => s1_bus,
latch_en => iar_latch_en, out_en => iar_out_en);
tbr_reg : word_reg_1e
port map (clock => clock, d => dest_bus, q => s1_bus,
latch_en => tbr_latch_en, out_en => tbr_out_en);
mar_reg : word_reg_1e1
port map (clock => clock, d => dest_bus, q1 => s2_bus, q2 =>
addr_mux_in1,
latch_en => mar_latch_en, out_en1 => mar_out1_en);
addr_mux : word_mux2
port map (in0 => addr_mux_in0, in1 => addr_mux_in1, y => addr_out,
sel => addr_mux_sel);
mdr_reg : mdr
port map (clock => clock, d => mdr_in, q1 => s1_bus, q2 => data_out,
latch_en => mdr_latch_en, out_en1 => mdr_out1_en,
shift_ctrl => mdr_sh_ctrl, mar_ls2_in => addr_mux_in1(30 to
31));
mdr_mux : word_mux2
port map (in0 => dest_bus, in1 => data_in, y => mdr_in,
sel => mdr_mux_sel);
bus_const: bus_const32
port map ( q1 => s1_bus, out_en1 => const_o1_en,
q2 => s2_bus, out_en2 => const_o2_en,
sel => const_sel);
alu_overflow <= intrn_alu_overflow;
mar_adr_ls2 <= addr_mux_in1(30 to 31);
mar_adr_msb <= addr_mux_in1(0);
dest <= dest_bus ;
source_1 <= s1_bus;
source_2 <= s2_bus;
reg_c_in <= reg_file_in;
regf1out <=reg_file_out1;
regf2out <= reg_file_out2;
end structural;