pcie to parallel interface bridge in low-cost fpgaetd.dtu.dk/thesis/264828/bac10_33.pdf · pcie to...
TRANSCRIPT
Rasmus Bo Sørensen, s072080
Jaspur Højgaard, s072069
PCIe to Parallel Interface
bridge in low-cost FPGA
Vol. 1
Bachelor 's Thesis, June 2010
Rasmus Bo Sørensen, s072080
Jaspur Højgaard, s072069
PCIe to Parallel Interface
bridge in low-cost FPGA
Vol. 1
Bachelor 's Thesis, June 2010
2
PCIe to Parallel Interface bridge in low-cost FPGA, Vol. 1
Author(s): Rasmus Bo Sørensen, s072080 Jaspur Højgaard, s072069 Supervisor(s): Docent, Jens Sparsø, DTU Informatics Senior Hardware Manager, Thomas K. Jørgensen, Vitesse Semiconductor Corporation
Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kgs. Lyngby, Denmark Phone +45 45 25 33 51, Fax: +45 45 88 26 73 [email protected] www.imm.dtu.dk IMM-B.Sc.-2010-33
Release date:
25. june 2010
Class:
1 (public)
Edition:
1. edition
Comments:
This report is a part of the requirements to achieve Bachelor of Science in Engineering (BSc) at Technical University of Den-mark.
The report represents 20 ECTS points.
Rights:
© Vitesse Semiconductor Cooperation, 2010
3
ABSTRACT
The objective of this project has been to design a bridge between a PCI Express inter-
face on one side and a parallel interface on a Vitesse switch chip on the other side. In
addition, we found it appropriate to enhance the bridge with a serial interface for com-
munication to peripheral components. The bridge should be designed for implementa-
tion in a low-cost FPGA.
We have developed a bridge for communication between a PCI Express interface on
one side and a parallel interface and a serial interface on the other side. The developed
bridge has the capability of interfacing against 3 different parallel interfaces and 4 dif-
ferent serial interfaces. Given the requirements of the design, we have chosen the best
suited FPGA for the implementation of the bridge, a EP4CGX15BF14C8 from Altera.
The implemented design has been tested through simulation using ModelSim.
The schematic for a PCB reference board has been made. Such an interface board can
be used for testing the bridge against Vitesse switch reference systems.
By making use of the split transaction capability of the PCI Express interface, we avoid
that the serial and parallel interfaces restrict each other, with regards to their bandwidth.
5
RESUMÉ
Målsætningen med dette projekt har været at designe en bro mellem et PCI Express in-
terface på den ene side og et parallel interface, på en Vitesse switch chip, på den anden
side. I tillæg har vi fundet det passende at udbygge broen med et serielt interface til
kommunikation med omkringlæggende komponenter. Broen skal designes til at imple-
menteres i en lav-pris FPGA.
Vi har udviklet en bro til kommunikation mellem et PCI Express interface på den ene
side og et parallel interface og et serielt interface på den anden side. Den udviklede bro
har mulighed for at kommunikere med 3 forskellige parallel interface og 4 forskellige
serielle interface. Med de givne krav til designet, har vi valgt den bedste FPGA til im-
plementeringen af broen, en EP4CGX15BF14C8 fra Altera. Det implementeret design
er blevet tested gennem simulation med ModelSim. Vi har endvidre lavet et diagram til
et PCB reference print. Et sådan interface print kan bruges til at teste broen sammen
med Vitesse switch reference systemer. Ved at gøre brug a split transaction muligheden,
der er indbygget i PCI Express interfacet, undgår vi at de serielle og parallelle interface
begrænser hinanden med hensyn til deres båndbredde.
7
PREFACE
This report is the result of a bachelor project from the Technical University of Denmark,
in cooperation with Vitesse Semiconductor Corporation. During the project, we have
had supervisors from both places. We have been two students working on this project
and we both wanted to carry out our bachelor project in cooperation with a company, to
get an experience outside the university and to get a feeling of a real life engineering
problem. Between the possible project descriptions we received from Vitesse, we chose
this project because we thought it was an interesting combination between designing a
digital system and fitting it to the given standards. The project description from Vitesse
is seen below.
PCIe-to parallel interface bridge in low-cost FPGA
This project is to develop a PCIe brige for the Vitesse Ethernet Switches and MACs,
so that they can be connected to CPU systems that have a PCI Express interface.
The project can also include design and layout of a small interface board that holds
the FPGA.
The project should contain the following activities:
Data gathering - understanding the requirements for the PCIe bridge
Write up a requirement specification
Research the market for small FPGA to find the lowest cost device that meets
the requirements
Development of the VHDL/VERILOG code for the device
Verify the code on a testbench
Test and debug of the system
Write report and application note
As a part of doing a commercial project with results that can be delivered to customers,
the documentation is especially important. As part of this project we have made an ap-
plication note from a Vitesse template and the application note is included in the appen-
dix. The application note is an important part of doing this project because it contains
8
details that we have experienced and are required to make the device work properly.
Because Vitesse is an American company, the template is of course in letter format and
may look a bit odd in A4 format.
While doing this project, we have had an office at Vitesse in Herlev, where we have
been working most of the time. The peripheral components relative to our project, have
been under development during the time of our project and so have their datasheets.
To get the information we have needed, we have spoken directly to the developers of
the chips and their reference systems. During our stay at Vitesse we have been partici-
pating in the weekly meetings in our department. These meetings have helped us getting
much of the information we have needed. When we spoke to the developers we got all
the information we needed, but it has complicated the documentation because of the
lack of references. Much of the documentation we have is in form of handwritten notes
on post-its.
During the time we have been at Vitesse, we have participated in the DSE messe at
DTU as Vitesse representatives, and held a presentation about our project and how it
was to carry out a project at Vitesse, for students from Ingeniør Højskolen København.
We have been two students working on this project. It is specified in the headings of
each section who has written what, with our initials.
Rasmus Bo Sørensen – (RBS)
Jaspur Højgaard – (JH)
The initials specified in a heading apply to the section of the heading and all of the sub-
section.
In this report, references will be written en the following format:
[Section name, Reference number] or [Reference number].
In the submitted report, a CD with the source code, the test files, the schematics and the
report itself has been included.
It has been a great experience to work with the engineers at Vitesse. We have gotten a
lot of helpful input to our project, and therefore we would like to thank our supervisor
Thomas K. Jørgensen and the rest of the hardware group for their help.
9
TABLE OF CONTENT
Vol. 1
Abstract............................................................................................................................ 3
Resumé ............................................................................................................................. 5
Preface.............................................................................................................................. 7
List of figures ................................................................................................................. 13
List of tables .................................................................................................................. 15
Abbreviations ................................................................................................................ 17
1 Introduction ........................................................................................................... 19
1.1 Problem background ........................................................................................ 19
1.2 Problem statement ............................................................................................ 21
1.3 Target audience ................................................................................................ 21
1.4 The report structure .......................................................................................... 22
2 Interfaces of the PCIe-Bridge .............................................................................. 23
2.1 PCI Express (JH) .............................................................................................. 24
2.1.1 PCI Express components .......................................................................... 25
2.1.2 Transaction Layer Packets ........................................................................ 25
2.1.3 Interrupts ................................................................................................... 28
2.1.4 Virtual channels ........................................................................................ 28
2.2 Parallel interface (RBS) ................................................................................... 29
2.2.1 Posted Reads ............................................................................................. 30
2.2.2 Interrupts ................................................................................................... 32
2.3 Serial interface (JH) ......................................................................................... 32
2.4 General purpose input / output pins (RBS) ...................................................... 34
2.5 Requirements and usage of interfaces (RBS) .................................................. 35
3 Selecting an FPGA ................................................................................................ 37
3.1 FPGA requirements (JH).................................................................................. 37
3.1.1 PCI Express Intellectual Property ............................................................. 37
3.1.2 IO Pins ...................................................................................................... 38
3.2 Overview of possible FPGAs (JH) ................................................................... 39
Table of content
10
3.3 About the chosen FPGA (JH) ........................................................................... 40
4 Design of the PCIe-Bridge .................................................................................... 41
4.1 Structural design (RBS) .................................................................................... 41
4.2 Components of the structural design ................................................................ 43
4.2.1 PCIe-module (JH)...................................................................................... 43
4.2.2 TLP-Switch (RBS) .................................................................................... 44
4.2.3 TLP-Multiplex (RBS)................................................................................ 45
4.2.4 PI-module (RBS) ....................................................................................... 49
4.2.5 PI-IRQ-module (RBS) ............................................................................... 55
4.2.6 SI-module (JH) .......................................................................................... 55
4.2.7 Control-module (RBS) .............................................................................. 60
4.2.8 Unsup-module (JH) ................................................................................... 61
4.3 Address space of the PCIe-Bridge (JH) ............................................................ 62
4.4 Expected performance (RBS) ........................................................................... 63
5 Implementation of the PCIe-Bridge .................................................................... 65
5.1 Implementation of HDL (RBS) ........................................................................ 65
5.2 Design of interface board (JH) ......................................................................... 66
5.2.1 Components ............................................................................................... 67
5.2.2 Schematics ................................................................................................. 68
5.2.3 Layout ........................................................................................................ 71
6 Testing and verification ........................................................................................ 73
6.1 Simulation (RBS) ............................................................................................. 73
6.1.1 Structure of testbench ................................................................................ 73
6.1.2 Methodology of testing ............................................................................. 74
6.1.3 Verifying packet routing ........................................................................... 75
6.1.4 Verifying functionality of modules ........................................................... 75
6.2 Synthesis ........................................................................................................... 79
6.2.1 Hardware usage (JH) ................................................................................. 79
6.2.2 Meeting the timing constraints (RBS) ....................................................... 81
7 Conclusion .............................................................................................................. 83
7.1 Results .............................................................................................................. 83
7.2 Perspectives ...................................................................................................... 84
7.3 Further work ..................................................................................................... 85
References ...................................................................................................................... 87
Vol. 2 (Confidential)
A Application note 7
Table of content
11
B PCB schematics 17
C Test files 29
D Source code 45
E Datasheet extracts 163
F PCI Express extract 173
13
LIST OF FIGURES
Figure 1-1: External CPU interfacing to Vitesse switch chip, through parallel
interface. ................................................................................................................ 19
Figure 1-2: External CPU interfacing to Vitesse switch chip, through PCI
Express interface. .................................................................................................. 20
Figure 1-3: External CPU interfacing to multiple Vitesse switch chips, through
PCI Express interface. ........................................................................................... 20
Figure 2-1: Overview of the interfaces on the PCIe-Bridge. ......................................... 23
Figure 2-2: The structure of the PCIe link, between the PCIe-Bridge and the
External CPU, showing the layer of the PCI Express interface. ........................... 24
Figure 2-3: Serial order of Transaction Layer Packets. ................................................. 25
Figure 2-4: Timing diagram of a read access through the parallel interface. ................ 30
Figure 2-5: Timing diagram of a write access through the parallel interface. ............... 30
Figure 2-6: Timing of accesses when doing a posted read from an 16-bit
parallel interface. ................................................................................................... 31
Figure 2-7: The bandwidth of a 16-bit parallel interface on the Vitesse chips,
as a function of posted reads. ................................................................................ 31
Figure 2-8: Timing diagram of a write sequence to the Luton26 chip, through
the serial interface. ................................................................................................ 33
Figure 2-9: Timing diagram of a read sequence using Luton26 serial interface. .......... 33
Figure 4-1: Block diagram of the structural design of the PCIe-Bridge........................ 42
Figure 4-2: Timing diagram of Avalon-ST interface. ................................................... 43
Figure 4-3: State diagram of the TLP-Switch................................................................ 44
Figure 4-4: Timing diagram of altered Avalon-ST interface used in the TLP-
Multiplex. .............................................................................................................. 46
Figure 4-5: Timing diagram of altered Avalon-ST, when multiple
StartOfPacket signals are high. ............................................................................. 47
Figure 4-6: State diagram of state machine in the TLP-Multiplex. ............................... 48
14
Figure 4-7: The maximum bandwidth of the shared-bus when connected to 3
16-bit parallel interfaces. ....................................................................................... 50
Figure 4-8: Flowchart for executing requests in the control state machine of
the parallel interface. .............................................................................................. 51
Figure 4-9: Block diagram of the PI-module. ................................................................ 52
Figure 4-10: State diagram of the PI ctrl state machine. ................................................ 53
Figure 4-11: State diagram of the PI interface state machine. ....................................... 54
Figure 4-12: State diagram of the PI completion state machine. ................................... 54
Figure 4-13: Block diagram of the SI-module. .............................................................. 56
Figure 4-14: State diagram of SI FSM Input.................................................................. 57
Figure 4-15: State diagram for SI FSM control. ............................................................ 57
Figure 4-16: State diagram of interface state machine in the SI-module. ...................... 58
Figure 4-17: State diagram of output state machine in the SI-module........................... 59
Figure 4-18: State machine of the Control-module. ....................................................... 61
Figure 4-19: State diagram for unsupported TLP module. ............................................ 62
Figure 5-1: Structural block diagram of PCIe-Bridge for testing routing of
packets.................................................................................................................... 65
Figure 5-2: Overview of components on the PCIe-Bridge interface board ................... 68
Figure 5-3: An overview on how the components will be placed on the PCB. ............. 71
Figure 6-1: Block diagram of testbench. ........................................................................ 74
Figure 6-2: StartOfPacket signals of TLP-Switch, while testing routing of
packets.................................................................................................................... 75
Figure 6-3: Timing diagram of interrupt signals, when testing PI-IRQ-module. .......... 76
Figure 6-4: Measured bandwidth of the implemented PI-module, as a function
of posted reads. When the all the 3 PIs are 16-bit wide. ........................................ 78
Figure 7-1: Block diagram Backplane switch. ............................................................... 85
15
LIST OF TABLES
Table 2-1: Possible Transaction Layer Packet requests and completions. .................... 26
Table 2-2: Transaction Layer Packet header for Memory requests, 32-bit
address. .................................................................................................................. 26
Table 2-3: Fields and structure of a completion TLP header......................................... 27
Table 2-4: Fields and structure of a message TLP header. ............................................ 28
Table 2-5: The signals and their description of the parallel interface of the
Vitesse switch chips. ............................................................................................. 29
Table 2-6: Width of the address and data signals, for the MPLS, Jaguar and
Luton26, Vitesse switch chips. .............................................................................. 29
Table 2-7: Description of the signals of the SPI bus. .................................................... 32
Table 2-8: Description of required general purpose input/output pins for the
PCIe-Bridge. .......................................................................................................... 34
Table 2-9: Table of requirements of the PCIe-Bridge and its interfaces. ...................... 35
Table 2-10: Required number of interfaces in PCIe-Bridge. ......................................... 35
Table 3-1: List of possible IP solutions. ........................................................................ 38
Table 3-2: The number of IO pins used by the different interfaces. .............................. 38
Table 3-3: The number of IO pins needed in the FPGA, to implement the
desired interfaces. .................................................................................................. 39
Table 3-4: Price of possible FPGA solutions. ................................................................ 39
Table 3-5: Attributes of chosen Cyclone IV GX FPGA ................................................ 40
Table 4-1: The number of output ports on the TLP-Switch and a description of
which requests are send to which output port. ...................................................... 45
Table 4-2: Priorities of input ports of the TLP-Multiplex. ............................................ 48
Table 4-3: Maximum bandwidth for SI reads and writes .............................................. 55
Table 4-4: Address space of the Control-module. ......................................................... 60
Table 4-5: Address space of the PCIe-Bridge................................................................ 63
Table 5-1: Filenames of the implemented modules. ...................................................... 66
16
Table 5-2: The supply voltage for the PCIe-Bridge interface board, for each of
the reference boards. .............................................................................................. 70
Table 5-3: Needed supply voltages by components ....................................................... 70
Table 6-1: Testable functionality of modules in the PCIe-Bridge. ................................ 76
Table 6-2: Results of PI-IRQ-module test. ..................................................................... 77
Table 6-3: Total hardware usage of the PCIe-Bridge design on FPGA ......................... 79
Table 6-4: Expected memory bit and register usage of implemented FIFOs ................. 80
Table 6-5: Expected LC register usage for the PCIe-Bridge design .............................. 81
17
ABBREVIATIONS
CPU Central Processing Unit
DW Double Word (32 bits)
FIFO First In First Out
FPGA Field Programmable Gate Array
GPIO General Purpose Input Output
HDL Hardware Description Language
IO Input Output
IP Intellectual Property
MPLS Multiprotocol Label Switching
PCB Printed Circuit Board
PCI Peripheral Component Interconnect
PCI-SIG PCI-Special Interest Group
PCIe PCI Express
PHY Physical layer
PI Parallel Interface
QW Quadruple Word (64 bits)
SI Serial Interface
SPI Serial Peripheral Interconnect
TLP Transaction Layer Packet
VHDL VHSIC Hardware Description Language
VHSIC Very High Speed Integrated Circuit
19
1 INTRODUCTION
1.1 Problem background
The Vitesse Semiconductor Corporation makes Ethernet switch chipsets that their Cus-
tomers integrate into their Ethernet switch products. Customers using these Vitesse
chipsets might require an external CPU system for performing management tasks, such
as collection of statistical data and configuration of the switch. In order to meet this re-
quirement, some of the Vitesse switch chips have a parallel interface, for external CPU
communication as outlined in Figure 1-1.
External
CPUParallel Interface
Figure 1-1: External CPU interfacing to Vitesse switch chip, through parallel interface.
With the introduction of PCI Express, customers now requests the ability to access Vi-
tesse switch chips from external CPU systems by taking advantage of this flexible and
widely supported standard. Adding a PCI Express interface module to the existing Vi-
tesse switch chips is not a feasible option, and hence building a “PCI Express to Parallel
Interface” bridge (PCIe-Bridge) becomes relevant. The PCIe-Bridge should not limit the
operation between the External CPU and the switch chip, neither with respect to func-
tionality nor to speed. The price of the PCIe-Bridge will be in addition to the price of
the switch chip. Therefore, the component price of the PCIe-Bridge must be low relative
to the price of the switch chip. A solution is to use a low-cost FPGA for the implemen-
tation of the PCIe-Bridge. This setup is outlined in Figure 1-2.
Introduction
20
External
CPUParallel Interface PCI ExpressBridge
FPGA
Figure 1-2: External CPU interfacing to Vitesse switch chip, through PCI Express in-
terface.
Vitesse has two newly developed switch chips, the Luton26 and the Jaguar. The Jaguar
chip is a 24x1G + 4x10G Carrier Ethernet Switch, the Luton26 is a 26x1G Ethernet
Switch with 12 integrated 1000BASE-T PHYs. These two switch chips both have a pa-
rallel interface for onboard external communication. The two switch chips also have a
serial interface for booting of their internal CPU or communication to peripheral com-
ponents.
For the Jaguar, it is possible to combine two of the same switch chips into one switch
with twice as many ports. The PCIe-Bridge shall support multiple interfaces of these
brand new types of switches. Furthermore, the configurations with the Jaguar switch
chips should support an additional MPLS module, this configuration is seen in Figure
1-3. The parallel interface on the MPLS module is very similar to the parallel interface
in the Jaguar
External
CPUParallel Interface PCI ExpressBridge
FPGA
MPLS module
Parallel Interface
Par
allel Int
erfa
ce
Figure 1-3: External CPU interfacing to multiple Vitesse switch chips, through PCI
Express interface.
Introduction
21
1.2 Problem statement
This project aims to design a PCIe-Bridge between PCI Express and the parallel inter-
faces, featured on the Vitesse Luton26 and Jaguar switches. The PCIe-Bridge should be
fast enough to ensure that it is the parallel interface which is restricting the speed of
communication. In some configurations, there is a need to interface to multiple parallel
interfaces through the bridge. Furthermore, the bridge must be implemented in a low-
cost FPGA. Now that we are implementing this PCIe-Bridge it would be favorable to
have some additional functionality. This additional functionality could be a serial inter-
face, and a number of general-purpose input/output pins, for controlling the operation
state of the switch chips.
A solution to the described problem will be found by answering the following ques-
tions:
Which requirements are there to the PCIe-Bridge?
Which type of FPGA meets the requirements of the PCIe-Bridge best?
What is a good design of the PCIe-Bridge?
How can the PCIe-Bridge be implemented?
How can the implementation of the PCIe-Bridge be tested and verified?
To state the requirements to the bridge, we will investigate the protocols of the different
interfaces mentioned in the problem statement. Based on the stated requirements and the
knowledge obtained from this investigation, we will find the lowest-cost FPGA that
meets the requirements. With the physical restrictions at hand, we will examine how the
functional design of the bridge can meet the requirements. Knowing the design and the
type of FPGA, we can then determine how to implement the bridge in a real life confi-
guration. With the implementation and the design in mind, the bridge can be tested and
verified.
1.3 Target audience
The target audience for this report will be:
The auditors of our Bachelor thesis.
Developers at Vitesse who wants to use and maintain our PCIe-Bridge design to
connect an external CPU to a Vitesse switch chip through PCI Express.
Other developers interested in designing FPGA systems connected through PCI
Express to other digital components.
Customers who wants to integrate the PCIe-Bridge should read the application
note.
Introduction
22
1.4 The report structure
The main part of this report is divided into the following 5 chapters:
Interfaces of the PCIe-Bridge – Here we explain the basic theory behind the interfaces
used in this project, and how they can be combined. We also look at the different re-
quirements of the interfaces.
Selecting an FPGA – We list the arguments for the different types of FPGA to use in
this project and how they will affect the design of the PCIe-Bridge. On this basis we
choose which type of FPGA will suit this project the best.
Design of the PCIe-Bridge – In this chapter, we discuss the design considerations for
each of the modules in the PCIe-Bridge, and describe the design we are implementing.
Implementation of the PCIe-Bridge – We explain the process of implementing the
PCIe-Bridge.
Testing and verification – We look at how the PCIe-Bridge can be tested, and we seek
to verify that the implementation is working properly.
23
2 INTERFACES OF THE PCIE-BRIDGE
To understand the requirements of the PCIe-Bridge, there is a need to understand the
different interfaces that the PCIe-Bridge will connect to. In this chapter we provide an
overview of the interface protocols used in the PCIe-Bridge. An overview of the inter-
faces can be seen in Figure 2-1.
This will be the following interfaces:
PCI Express, for communication with an external CPU.
Parallel Interface, for communication with the Vitesse switch chips.
Serial Interface, for communication with the Vitesse switch chips.
General Purpose Input/Output pins, for controlling the operation state of the Vi-
tesse switch chips.
External
CPUPCIe LinkPCIe_Bridge
FPGA
Parallel interface
Par
allel int
erfa
ce
Parallel interface
Serial interface
Ser
ial int
erfa
ce
Serial interface
GPIO
GPIO
GPIO
Co
nn
ecto
rs
Figure 2-1: Overview of the interfaces on the PCIe-Bridge.
Interfaces of the PCIe-Bridge
24
2.1 PCI Express (JH)
The Peripheral Component Interconnect Express bus, abbreviated PCIe, was introduced
in 2004. It has replaced the older PCI bus and other internal chip interconnects. The
PCIe protocol is structured around serial component-to-component Links. A PCIe Link,
consists of differentially driven signal pairs, divided into transmit pairs and receive
pairs. One such set of signals, with one pair for transmit, Tx in Figure 2-2, and one for
receive, Rx in Figure 2-2, is called a PCIe lane (Appendix F.1). A PCIe Link can con-
sist of a multiple of 1, 2, 4, 8, 16 or 32 lanes, to increase the bandwidth of the Link. For
the first generation of PCIe technology, the effective bandwidth is 2.5 Giga-
bits/second/lane/direction. The nominal bandwidth in each direction pr. lane will be
250 MB/s because first generation PCIe utilizes 8b/10b encoding. Taking overhead into
account 200 MB/s are usable by components for data transfer [PCI Express Architec-
ture, 1].
The PCIe architecture consists of the three discrete logical layers: The Transaction
Layer, the Data-Link Layer and the Physical Layer (PHY).
Transaction Layer: Is the top-level layer. It is responsible for assembling and dis-
assembling packets, sent or received over the PCIe Link.
Data-Link-Layer: Is the middle layer. It functions as a stage between the Transac-
tion layer and the Physical layer. This includes Link management and data integrity,
with error detection and correction.
Physical Layer: Is the lowest level layer. It includes the circuitry for interface op-
eration. This includes: driver and input buffer, parallel-to-serial and serial-to-
parallel conversion, one or multiple PLLs and impedance matching circuitry.
Communicating between, a PCIe link in this case between an external CPU and the
PCIe-Bridge can be seen in Figure 2-2.
External CPU systemPCIe-Bridge
App.
layer
Physical
layer
Trans-
action
layer
Data-
link-layer
Physical
layer
Trans-
action
layer
Data-
link-layer
Trans-
action
layer
Data-
link-layer
Physical
layer
Trans-
action
layer
Data-
link-layer
Physical
layer
External
CPU
Rx Tx
Tx Rx
Figure 2-2: The structure of the PCIe link, between the PCIe-Bridge and the External
CPU, showing the layer of the PCI Express interface.
The three layers are split into two halves, one handling incoming data traffic and one
handling outgoing data traffic. On top of the Transaction Layer will be an application
layer, implemented in the PCIe-Bridge.
Interfaces of the PCIe-Bridge
25
2.1.1 PCI Express components
Components communicating through a PCIe Link can either function as a root port or
an endpoint devices. The root port device is connected to the root complex of the PCI
Express hierarchy. The root port maps a portion of a PCI Express hierarchy to end-
points, PCI Express-PCI bridges or PCI Express fabric switches (Appendix F.1). A PCI
Express endpoint can be a requester, completer or both on its own account, or through
another non-PCIe component. The endpoint function is what is needed for the PCIe-
Bridge. Therefore the root port will not be investigated further.
A PCIe endpoint is identified by the four values:
Vendor ID: An ID unique for each vendor, and is assigned by PCI-Special Interest
Group (PCI-SIG). This is reserved for the manufacturer of the PCIe component.
Device ID: Is a unique ID given to the PCIe component by the vendor.
Revision ID: Is a value stating the revision of the device.
Class Code: Is a 24-bit value specifying what type of device it is.
When an endpoint is connected to a root port, the Base Area Registers (BARs) specify
the address space needed for the specific PCIe component. The values are found in the
PCIe endpoint configuration space registers [2].
2.1.2 Transaction Layer Packets
The PCI Express interface is packet based, these packet are called Transaction Layer
Packets (TLPs). The TLPs consist of a header, some data payload, if applicable, and an
optional TLP digest. For more information on the TLP digest see Appendix F.3. The
packets are sent serially over the PCIe link, in the order shown in Figure 2-3.
TLP HeaderData Payload
(if applicable)
TLP digest
(Optional)
Figure 2-3: Serial order of Transaction Layer Packets.
The Data Payload does not have a fixed size. The amount of data to be written or read is
set in the header of the TLP. When writing or reading more than 32-bits in one request,
the request is denoted burst-request. The PCIe-Bridge does not support any TLP digest,
and this will not be explored any further.
The Transaction Layer provides four address spaces. The three PCI address spaces:
Memory, I/O and configuration. The fourth address space is a message space [Transac-
tion layer, 1]. To each of the PCI address spaces, the requests read and write can are
possible. A list of possible TLPs is found in Table 2-1
Interfaces of the PCIe-Bridge
26
Table 2-1: Possible Transaction Layer Packet requests and completions.
Type of TLP Functional description
Memory read/write Reads/writes data from/to memory
I/O read/write Reads/writes data from/to I/O
Configuration read/write Reads/writes data from/to Config. Reg.
Message Sends message. Is also used for interrupt
Completion with/without data Sends completion to a request
There are two main categories of TLPs: requests and completions. The Memory write
request is the only request that does not require a completion. For a read request, the
data read will be returned as data payload in the completion TLP.
The format of the header of the TLPs varies with type of request. A header for a Memo-
ry request is seen in Table 2-2.
Table 2-2: Transaction Layer Packet header for Memory requests, 32-bit address.
Byte + 0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0 R Fmt Type R TC R T
D
E
P Attr R Length
4 Requester ID Tag Last BE First BE
8 Addr[31:2] R
In this case the address has a length of 32-bits. Many of the new CPU-systems run with
a 64-bit addressing. When 64-bit addressing, the bytes 8-11 in Table 2-2 are placed in
bytes 12-15 instead and the bytes and bytes 8-11 of the header become Addr[63:32]
instead.
The header of a request has a length of either 3 or 4 Double Words (DW), where a DW
is 32-bits. The header consists of different values. The R values are reserved and set
low. The other values in Table 2-2 and their functions are listed below
Addr: The address field that contains the starting address of the request. The ad-
dress is bound to be DW aligned while the last two bits are reserved.
Fmt: In this field, it is set whether the request is a with or without data, and whether
it is a 32-bit or 64-bit address used.
Type: This value determines what type the request is.
TC: Is the traffic class field. Sets what service class the request is in (Appendix
F.3). It ultimately determines the relative priority of the PCI Express transaction [3].
TD: Bit asserted if there is a TLP digest. For the PCIe-Bridge, this bit must be low.
If a TLP with a TLP digest is received, the PCIe-Bridge will malfunction.
Interfaces of the PCIe-Bridge
27
EP: The EP bit is set if the Transaction Layer has detected an error in the TLP. This
is a non-fatal error [Error Handling, 4], and therefore this bit will not be taken into
account in the PCIe-Bridge.
Attr: Is used to provide additional information that allows modification of the de-
fault handling of Transactions (Appendix F.3).
Length: The Length value is the number DW to be read or written. This value is
used in burst-read and burst-write requests.
Requester ID: Every device, communicating through PCI Express, is given a de-
vice ID. This device ID is static and is set to the Requester ID value.
Tag: The tag is given to each transaction so that Requester ID and tag form a
unique Transaction ID.
Last BE and First BE: Byte enables that are used to qualify bytes of interest in the
first and last DW transferred. This allows offsetting the address from the DW boun-
daries, and also allows transfers smaller than one DW [3].
The Length field is 10-bits, this means that the maximum payload of a TLP can be up to
1024 DW. The maximum payload can be restricted in a specific endpoint device.
PCI Express supports split transaction. As mentioned for the Tag, the Requester ID and
Tag field form a Transaction ID. This Transaction ID is reserved as long as a comple-
tion has not been sent. This means that as long as there is a unique Transaction ID avail-
able, a request can be sent. Each PCIe component specifies how many Tags are sup-
ported, and thereby, what the maximum number of unresolved requests.
A request requires a completion and to identify which request the completion is for, the
Transaction ID is included in the Completion TLP header as seen in Table 2-3.
Table 2-3: Fields and structure of a completion TLP header.
Byte + 0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0 R Fmt Type R TC R T
D
E
P Attr R Length
4 Completer ID Completion
status
B
C
M
Byte count
8 Requester ID Tag R Lower address
A Completion header is always 3 DW, contrary to that of the memory request headers.
For a completion header there are some values that were not present in the Message
request header. The new values, and their functions are described below.
Completer ID: Completer ID is a value set together of three separate fields: Bus
number, Device number and Function number. It is unique for every PCI Express
function and may well change during runtime.
Interfaces of the PCIe-Bridge
28
Completion status: Value telling the status of a request i.e. if request was success-
ful or not.
BCM: Is a field for the PCI-X standard and must not be set by PCI Express com-
pleters.
Byte count: If the completion to a read request is split into multiple completion
packets, this field shows how many bytes remain to be read (Appendix F.3). For a
memory write, the byte count is always set to 4.
Lower address: This field contains bits 2 to 8 from the Addr field of the request.
If there is any data to return, it will be sent directly after the header.
2.1.3 Interrupts
Interrupts over a PCIe Link can be handled in different ways. One way, is sending an
interrupt using a message TLP. The way this is done is by using the INTx virtual wire
signaling mechanism. The INTx interrupt mechanism uses message TLPs to signal an
interrupt. The header of a message TLP is seen in Table 2-4
Table 2-4: Fields and structure of a message TLP header.
Byte + 0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
0 R Fmt Type R TC R T
D
E
P Attr R Length
4 Requester ID Tag Message Code
8 R
12
The header looks much like that of a Memory request TLP. The Message Code field is
the only unique field and is used to indicate what kind of message the TLP is. The INTx
mechanism has eight distinct messages. The eight interrupt messages are split into two
categories: Assert and Deassert. The INTx implements four virtual wires for each end-
point, that each have an assert and deassert message associated.
2.1.4 Virtual channels
Virtual Channels (VC) can be used together with the traffic class (TC) field to route and
prioritize packets, when sent over the PCI Express fabric. A PCI Express Link may con-
sist of multiple virtual channels. Devices may support up to eight virtual channels, and
they are weighed when routed through the PCI Express fabric such that virtual channel
number 0 (VC0) gets the best service and VC7 the least.
Traffic classes are mapped to virtual channels, such that multiple TCs can be mapped to
each TC, while multiple VCs cannot be mapped to one TC. VC0 and TC0 (TC =
“000”), are always mapped together. The TC1 class can be mapped to either VC1 or
VC0, and TC2 can only be mapped to VC2, VC1 and VC0 and so on. For a detailed
description on virtual channels, go to (Appendix F.4).
Interfaces of the PCIe-Bridge
29
2.2 Parallel interface (RBS)
The parallel interface (PI) on the Vitesse switch chip can operate in master mode or in
slave mode. In master mode, the switch chip can access peripheral components through
the PI. In slave mode, an external device i.e. a CPU can access the internal registers in
the switch chip. When the external CPU is connected to the PI of one or multiple Vi-
tesse switch chips through the PCIe-Bridge, the PCIe-Bridge will always be master of
the PI. The PI signals and timing diagrams of the Luton28 is found in the pages 226-228
of Appendix E.1 and repeated in Table 2-5, Figure 2-4 and Figure 2-5. The PI signals
and timing diagram of the Luton28 are similar to the Jaguar and the Luton26. The refer-
ence is to the Luton28 datasheet because the datasheets of the Jaguar and Luton26 are
not finished.
Table 2-5: The signals and their description of the parallel interface of the Vitesse
switch chips.
Signals Type Description
PI_Addr[ ] I Address of the register
PI_Data[ ] I/O The data to be written or read
PI_RnW I Read/write control pin
PI_nOE I Enables the data output of the PI
PI_nCS I Chip select, for selecting the specific chip
PI_nDone O Acknowledge control pin
PI_IRQ[ ] O Interrupt signal
The supported types of switch chips have different widths of the signals PI_Addr and
PI_Data. The width of these signals can be seen in Table 2-6 [5].
Table 2-6: Width of the address and data signals, for the MPLS, Jaguar and Luton26,
Vitesse switch chips.
Chip type Signal Width [# Bits]
Jaguar / MPLS PI_Addr 24
PI_Data 16
Luton26 PI_Addr 4
PI_Data 8
The parallel interface of the MPLS module is similar to the parallel interface of the Ja-
guar. The way of interfacing to the MPLS module and the Jaguar switch chip is identic-
al only the addresses of the registers vary.
The internal registers in the MPLS, Jaguar and the Luton26 switch chips are 32 bit
wide. To read a 32-bit register through the 16-bit data interface, 2 consecutive reads
with incrementing addresses are required, and 4 consecutive reads with incrementing
addresses through the 8-bit data interface.
Interfaces of the PCIe-Bridge
30
The timing diagram of a read through the PI is seen in Figure 2-4.
PI_ADDR
PI_Data
PI_nWR
PI_nOE
PI_nCS
PI_nDone
Addr
Data
Figure 2-4: Timing diagram of a read access through the parallel interface.
The address and the output enable signals of the PI have a setup time of 4 ns, after the
setup time the chip-select signal can be set low. When the data of the read access is va-
lid the nDone signal goes low, and stays low until the chip-select goes high again.
The timing diagram of a write through the PI in seen in Figure 2-5.
PI_ADDR
PI_Data
PI_nWR
PI_nOE
PI_nCS
PI_nDone
Addr
Data
Figure 2-5: Timing diagram of a write access through the parallel interface.
The address and data signals of the PI have the setup time of 4 ns, after the setup time
the chips-select signal can be set low. The nDone signal goes low when the data is writ-
ten. For the précis timing specifications, look in Appendix E.2.
2.2.1 Posted Reads
The internal registers of the two switch chips and the MPLS co-processor consist of
normal registers and fast registers, which have different access times. Because the data-
sheet of the 3 chips are not finished yet, we do not have the addresses of the different
registers. The access time of a normal register is 470 ns, and the access time of a fast
register is up to 65 ns. This difference in access time have no effect when registers are
written to, all writes take up to 65 ns. To increase the bandwidth when reading from
normal registers, the execution of such a read can be split into three operations, a so
called posted read. The timing of the posted read can be seen in Figure 2-6.
Interfaces of the PCIe-Bridge
31
The three operation of a posted read are:
A single read from the normal register.
Wait for minimum 470 ns.
Read a number of consecutive reads from the fast register.
A posted read is initialized by reading from the address of the normal registers. The data
in the normal register is then transferred to a predefined fast register, in the Vitesse do-
cumentation referred to as the SLOWDATA register. After 470 ns the fast register can
be read like any other fast register (Appendix E.1). There can only be one posted read in
progress at the time in each switch chip.
470 ns
Read from Addr
65 ns 2*65 ns = 130 ns
Read from SLOWDATAAccesses from other registers
Figure 2-6: Timing of accesses when doing a posted read from an 16-bit parallel inter-
face.
With the ability of splitting a read from a normal register up into two, we get the oppor-
tunity to execute reads from fast registers and writes in the access time of a normal reg-
ister. This increases the maximum bandwidth, depending on the distribution of reads
from normal registers. The maximum bandwidth of the PI, with and without the use of
posted reads can be seen in Figure 2-7. The maximum bandwidth is calculated by di-
viding the number of byte by the time it takes to execute the respective proportion of
posted reads and non-posted reads, when the non-posted reads are executed in the wait-
ing time as far as possible.
Figure 2-7: The bandwidth of a 16-bit parallel interface on the Vitesse chips, as a func-
tion of posted reads.
Interfaces of the PCIe-Bridge
32
The posted reads graph in Figure 2-7 shows an increase of bandwidth between 0 % and
100 % posted reads, the increase of bandwidth is almost 100 % at 20 % posted reads. It
is favorable for the PCIe-Bridge to have the ability of executing posted reads, because it
increases the bandwidth to some extent no matter the proportion between posted reads
and non-posted reads.
2.2.2 Interrupts
As a part of the parallel interface there are 2 interrupt pins. These interrupt pins can be
programmed to act on different properties of the switch chip. We will not go into details
with these properties, because the programming of them is done in software. The prop-
erties are programmed in register on the switch chips or the MPLS co-processor.
2.3 Serial interface (JH)
The Vitesse switch chips can be accessed through a serial interface. The serial interface
transfers data from one unit to another one bit at a time. The serial interface on the Vi-
tesse switch chip, also referred to as the SI, is a variation of the Serial Peripheral Inter-
face (SPI) bus. The SPI bus is an industry standard and a component using the SPI bus,
can operate either in master or in slave mode [6]. Because the external CPU system ac-
cesses the Vitesse switch chip and not the other way around, the PCIe-Bridge will act as
master and the switch chip as slave.
The SPI interface consists of two control signals and two data signals. The two control
signals are the serial clock and the chip select, and the data signals are data in and data
out. A complete signal overview for the SI on the Vitesse switch chip can be seen in
Table 2-7: Description of the signals of the SPI bus.
Table 2-7: Description of the signals of the SPI bus.
Signal Type Description
SI_CLK I Serial clock. This has a maximum value of 25 MHz
SI_DI I Serial data in line, for receiving data.
SI_DO O Serial data out line, for transmitting data
SI_nCS I Chip select line, indicating when chip is being addressed
The serial clock has 25 MHz as maximum allowed frequency (Appendix E.2). If the
frequency is higher, there will be a risk of data corruption. While the SI is idle, the serial
clock is held low, and then going high when the first bit is ready to be sent. The chip
select is an active-low signal, telling if the chip is being addressed. It is held high when
idle and when a request is initiated, the chip select is pulled low.
The SI on the Luton26 and Jaguar Vitesse chips can handle read or write requests. A
write request for Luton26 will be executed as seen in Figure 2-8
Interfaces of the PCIe-Bridge
33
WR Addr22 Addr21 Addr1 Addr0
Min.
40 ns
SI_CLK
SI_nCS
SI_DI
SI_DO
Data31 Data30 Data1 Data0
Figure 2-8: Timing diagram of a write sequence to the Luton26 chip, through the serial
interface.
The write sequence starts with sending a read/write bit followed by a 23-bit address
followed directly again by 32-bits of data [7]. The bits are sent on the rising edge of the
serial clock.
The read sequence contains an interval of waiting in addition to sending the address and
retrieving data. The access time of the registers is maximally 1 µs and therefore the SI
will have to wait that time period before it can be certain that the accessed data is avail-
able. There are different methods of handling this wait interval. The one used in the
PCIe-Bridge is the setup where SI_CLK is held high for at least 1 µs. A read request
can be seen in Figure 2-9
WR Addr22 Addr21 Addr1 Addr0
Min. 1 µs
Data31 Data30 Data1 Data0
Min.
40 ns
SI_CLK
SI_nCS
SI_DI
SI_DO
Figure 2-9: Timing diagram of a read sequence using Luton26 serial interface.
The variation of the SPI bus lies in how a request is executed. The SPI bus supports data
being transferred in both directions at the same time. However, in the Vitesse switch
chips, data is only transmitted or received. Also for the SPI interface, there is no stan-
dard length of address or data. In addition, the read/write bit does not have to be the first
bit transmitted.
The version found on the Jaguar chips also has a word length of 32-bits but it has an
address length of 22-bits instead of the 23-bits that the Luton26 chip has. The read and
write request executions in the Jaguar chip though are equal to those in the Luton26
chip, with bit 22 in the address being a “do not care” bit.
Interfaces of the PCIe-Bridge
34
Because the SPI interface is an industry standard, it is supported by many components,
other than the Vitesse switch chips. Its presence therefore brings many additional possi-
bilities to the PCIe-Bridge. The SPI interface will be connected to two Vitesse switch
chips, the MPLS module and an additional Flash memory. Therefore, four serial inter-
faces are required on the PCIe-Bridge.
2.4 General purpose input / output pins (RBS)
In this application, the PCIe-Bridge is master of the system and the switch chips are
slaves. The master (PCIe-Bridge) needs three general-purpose I/O pins to take control
over the slaves (switch chips). These pins can be used for many different purposes, but
in this setup, they will be used as described in Table 2-8. The I/O pins have to be con-
trollable from the external CPU.
Table 2-8: Description of required general purpose input/output pins for the PCIe-
Bridge.
Pin name Description
nRESET Resetting the switch chips
VCore_cfg1 &
VCore_cfg0
Configuration of the operation
mode of the switch chips
Interfaces of the PCIe-Bridge
35
2.5 Requirements and usage of interfaces (RBS)
From what we have found out so far, we draw up a table of requirements to the different
interfaces and for the basic functionality of the PCIe-Bridge.
Table 2-9: Table of requirements of the PCIe-Bridge and its interfaces.
Requirements
The PCIe-Bridge
must:
Implement all three layers of the PCI Express interface as a PCI
Express endpoint.
Be capable of interpreting Transaction layer packets from the PCI
Express interface and return completions.
The PCI Express
interface must:
Be able to handle memory requests and return an unsupported re-
quest completion if another type of request is received.
Have capability for returning completions to all supported requests.
Implement the virtual wires functionality of the PCI Express base
specification, for interrupt handling, this should be done by sending
message requests.
The Parallel
Interface must:
Handle reads and writes.
Communicate with three different chips.
Be able to handle posted reads.
Register events on the different interrupt pins.
The Serial
Interface must:
Handle reads and writes.
Communicate with four different chips.
The GPIOs must: Be controllable by sending memory requests to the PCIe-Bridge.
The needed number of interfaces and their required capabilities in the PCIe-Bridge are
shown in Table 2-10. The data pr. transaction is a range for the PCI Express interface
and the parallel interface. The reads and writes of the PCI Express interface have the
capability of being burst reads and writes. The parallel interface has different widths of
the interface.
Table 2-10: Required number of interfaces in PCIe-Bridge.
Interface Count Supported transaction through
the interface
Data pr.
Transaction
PCI Express 1 Memory Read / Write – Message interrupt 1 DW – 32 DW
PI 3 Read / Write 1 byte – 2 bytes
SI 4 Read / Write 1 DW
GPIO 3 Assert / Deassert 1 bit
Interfaces of the PCIe-Bridge
36
The difference in data pr. transaction of the different interfaces yields that the execution
of a transaction takes different amounts of time. The difference in execution time has to
be dealt with in the design of the PCIe-Bridge.
37
3 SELECTING AN FPGA
The PCIe-Bridge will be implemented using an FPGA. In order to find a suitable FPGA
for this purpose, a rough overview of what is required of the FPGA is necessary. When
the requirement specification for the FPGA is found, an overview of possible solutions
from different FPGA vendors will be done, in order to find the best solution according
to functionality and price.
3.1 FPGA requirements (JH)
It is a necessity that the chosen FPGA will be able to interface to the different interfaces
described in the previous section. Therefore, the specific requirement for each interface
protocol must be available. For the general purpose input/outputs, the parallel interface
and the serial interface this will mean that some user controllable input/output pins on
the FPGA are free for the interfaces to use. For the PCIe interface though, there are spe-
cial hardware requirements for the Physical layer of the PCIe protocol stack.
3.1.1 PCI Express Intellectual Property
The three layers of the PCIe interface should be implemented in an FPGA using an In-
tellectual Property (IP), making an implementation of an PCIe endpoint would be a
project in itself. The PCIe IP featured on an FPGA contains all the three layers of the
PCIe interface, and the developer therefore only has to concentrate on what’s going on
in the Transaction Layer.
These PCIe IPs for FPGAs come in two categories: Hard IP and Soft IP. There are three
variations of PCIe IP solutions available and can be seen in Table 3-1.
Selecting an FPGA
38
Table 3-1: List of possible IP solutions.
Possible IP solutions Description
Hard IP A hard wired circuit implementing a PCI Ex-
press endpoint is included in the FPGA.
Soft IP with internal PHY A hard wired circuit implementing the transceiv-
ers and the rest of the Physical Layer, is in-
cluded in the FPGA. The Data link layer and the
transaction link layer are implemented in the
programmable FPGA logic.
Soft IP with external PHY The Data link layer and the Transaction layer
are implemented in FPGA logic and the Physi-
cal Layer is placed on an external chip.
The external chip opens up for the possibility to use the cheapest series of FPGAs, be-
cause there are no special requirements to the FPGA, other than having enough logical
elements to contain the Soft IP.
3.1.2 IO Pins
The parallel interface, the serial interface and the general purpose IOs featured in the
PCIe-Bridge need a number of controllable IOs on the FPGA. How many each interface
needs is seen in Table 3-2.
Table 3-2: The number of IO pins used by the different interfaces.
Interface # of IO pins needed
Parallel interface 46
Serial interface 4
General purpose IO 1
From Table 2-10 we know the number of required interfaces. The three parallel inter-
faces can each have a PI bus or they can share a single bus. The four serial interfaces
can have a SPI bus each or they can share a single bus. The number of IO pins needed
on the FPGA for these two solutions is seen in Table 3-3.
Selecting an FPGA
39
Table 3-3: The number of IO pins needed in the FPGA, to implement the desired inter-
faces.
Interface Count No. of IO pins
(Full parallel)
No. of IO pins
(Different chip select)
Parallel interface 3 144 50
Serial interface 4 16 7
General purpose IO 3 3 3
Total pins 163 60
The aim for the FPGA will be to have at least 60 IO pins available for the interfaces. It
would be an advantage if it supported bus for each interface with 163 IOs.
3.2 Overview of possible FPGAs (JH)
There are three main vendors that deliver low-cost FPGAs with PCI Express IPs. Those
are Xilinx, Lattice and Altera. The unit price for the Altera FPGAs and the Xilinx Spar-
tan 3 are from Digi-key. The Xilinx Spartan 6 is unit price at Avnet and the Lattice is
from Lattice’s own webpage.
The low-cost FPGA solutions and their price can be seen in Table 3-4. These FPGAs
have been chosen so they meet the minimum requirements of needed IOs. Furthermore,
for the FPGAs that need a soft IP, there should be enough leftover logic for the imple-
mentation of the PCIe-Bridge.
Table 3-4: Price of possible FPGA solutions.
Vendor FPGA fam-
ily Model number
Unit price External
PHY cost Total cost
[$] [$] [$]
Xilinx Spartan 3 XC3S50A-4FTG256C 8,8 15,4 24,2
Xilinx Spartan 6 XC6SLX25T-
2FG484C
64,3 0,0 64,3
Lattice ECP2M LFE2-12E-5QN208C 28,8 0,0 28,8
Altera Cyclone II EP2C5Q208C8N 13,9 15,4 29,3
Altera Cyclone IV EP4CGX15BN11C8N 24 0,0 25,2
The price of interest is the high volume price of the components. It is hard to say what
the high volume prices of the FPGAs are, so therefore we use the unit price as a refer-
ence, when choosing an FPGA.
The cheapest solution in one unit is the Altera Cyclone IV FPGA. Because the Cyclone
IV appears to be the cheapest solution, it will be the FPGA of choice. Furthermore, the
Selecting an FPGA
40
Cyclone IV series is a new family of FPGAs from Altera released in 2010, and therefore
the prices are likely to fall over time.
3.3 About the chosen FPGA (JH)
The Cyclone IV is a family of different FPGAs. Cyclone IV comes in two series, the E-
series and the GX-series. It is the GX-series that comes with the targeted PCI Express
Hard IP. Therefore the chosen FPGA is of the Cyclone IV GX series. The attributes for
the lowest cost FPGA in the Cyclone IV GX series can be seen in Table 3-5
Table 3-5: Attributes of chosen Cyclone IV GX FPGA
FPGA chosen Product no. PCIe IP No. of IOs Logical Elements
Altera Cyclone IV EP4CGX15 Hard IP 72 14400
These attributes fulfill the requirements for supporting all the necessary interfaces when
using chip select signals to components. The Hard IP implements an x1 lane PCI Ex-
press connection which has a bandwidth of 200 MB/s.
The Altera PCI Express module, which can be instantiated in the Quartus II design tool,
can be instantiated in two different modes, with two different interfaces:
Avalon-MM: This interface executes one request at the time, through a data and
an address based interface.
Avalon-ST: This interface has two streaming interfaces, a receiving interface
and a transmitting interface. Multiple requests can be executed at the time.
The split transaction functionality of the PCI Express specification is only used by the
Avalon-ST interface [Avalon-ST interface, 7].
41
4 DESIGN OF THE PCIE-BRIDGE
The PCIe-Bridge will be designed in this section, such that it meets the requirements
found in section 2.5. The design will seek to increase the throughput to a degree where
it is the parallel interface and the serial interface that are the restricting factors.
4.1 Structural design (RBS)
The structure of the PCIe-Bridge design depends on which interface type, described in
section 3.3, is chosen for the PCIe-module. The options are Avalon-MM and Avalon-
ST. It is required that the PCIe-Bridge have the ability of doing posted reads through the
PI and to utilize more than one of the different interfaces at the time. This yields that the
interface of the PCIe-module should support the split transaction feature of the PCI Ex-
press interface. The Avalon-ST is the only one of the Avalon interfaces that supports
split transactions through the PCI Express interface.
The structural design of the PCIe-Bridge must accommodate the given requirements.
The main requirement is full utilization of the PI bandwidth. This suggests that there
must be a separate control unit for each interface, and the possibility of prioritizing the
completions of the different interfaces. These considerations yield a block diagram as
seen in Figure 4-1. All the blocks in the block diagram contain a control unit, for con-
trolling the execution of incoming requests.
Design of the PCIe-Bridge
42
PC
I Exp
ress m
od
ule
(PC
Ie-m
od
ule
)
Ha
rd IP
Ava
lon
-ST
imp
lem
en
tatio
n
The TLP multiplexer
(TLP-Multiplex)
SI R
eq
ue
sts
(SI-m
od
ule
)
PI R
eq
ue
sts
(PI-m
od
ule
)
PCI Express Bridge (PCIe-Bridge)
The TLP switch
(TLP-Switch)
I/O a
nd
Co
ntro
l
Re
qu
ests
(Co
ntro
l-mo
du
le)
Un
su
pp
orte
d
Re
qu
ests
(Un
su
p-m
od
ule
)
PI In
terru
pt
Re
qu
ests
(PI-IR
Q-m
od
ule
)
Figure 4-1: Block diagram of the structural design of the PCIe-Bridge.
The TLP-Switch routes the incoming Transaction Layer Packet requests to their in-
tended module and the TLP-Multiplexer forwards the outgoing completions, in priori-
tized order, to the PCIe-module. To make the components easier to test, the packets
remain in the TLP format until they are processed in the modules. All the components
now have the same packet-based interfaces and can be tested individually.
Inside the modules, FIFOs store the packets in TLP format, until the packets are
processed. The size of these FIFOs is important to keep the bandwidth of the PCIe-
Bridge high. If the FIFO is too small, packets routed to the relevant module will remain
in the TLP-Switch and block for packets to the other modules. These packets will block
the TLP-Switch until there is room in the relevant FIFO again. A single packet can
block the TLP-Switch, if the FIFO in the relevant control unit is smaller than the packet
itself. The FIFO should therefore be larger than the size of the largest possible TLP. The
size of the largest possible TLP is 19 QWs, two QWs for the header and 17 for the data.
The maximum data payload in this application is 32 DWs, and if the address of the re-
quest is non-QW aligned the 32 DWs will span over 17 QWs. The FIFO in the SI-
module is the most critical with regard to size, because the SI is the slowest interface,
and thereby has the lowest throughput.
If multiple requests are waiting to enter the SI-module, and thereby blocking the TLP-
Switch, the bandwidth of the bridge will drop to the level of the SI bandwidth for as
long time as it takes to process the packets waiting to enter the SI-module.
Design of the PCIe-Bridge
43
4.2 Components of the structural design
The design process of each component depicted in Figure 4-1 will be described in this
section.
4.2.1 PCIe-module (JH)
The PCIe-module implements the PCI Express Link to the external CPU. This means
implementing the three layers of the PCI Express protocol. As stated in section 3.1.1
this is done using the Hard IP on the chosen FPGA instantiated as an Avalon-ST inter-
face. The instantiation of the PCIe-module is done using the MegaWizard plug-in man-
ager. The Hard IP block can be configured by running the MegaWizard plug-in manag-
er. The module is configured as an endpoint with an address space of 27-bits. In addi-
tion, the registers in the configuration space, mentioned in section 2.1.1 will be set.
Configuration requests to the PCIe-Bridge then be handled by the PCIe-module.
The Hard IP on the chosen FPGA only supports maximum TLP data payloads of 128 or
256 bytes. For the PCIe-Bridge, the maximum payload will be set to 128 bytes.
Acquiring data from the Avalon-ST module happens through a 64-bit data bus. The pro-
cedure of sending one packet can be seen in Figure 4-2.
Valid
Data
StartOfPacket
EndOfPacket
Ready
CLK
Qword 1 Qword 2 Qword 3
Avalon-ST
Figure 4-2: Timing diagram of Avalon-ST interface.
The order of the header and data DWs, in the Avalon-ST can be seen in [Avalon-ST
interface, 4].
The First and Last byte enable fields mentioned in section 2.1.2 will not be used in the
PCIe-Bridge design. This is because the registers read or written through the parallel
and the serial interface are all 32-bit registers. The addresses of the parallel and serial
interface are DW aligned. The byte offset possibility, that comes with the first and last
byte enable, will therefore not be used.
Design of the PCIe-Bridge
44
4.2.2 TLP-Switch (RBS)
The TLP-Switch uses address based routing to route the TLPs from the PCIe-module to
their intended module. For a detailed description of the address space of the PCIe-
Bridge, see section 4.3. When interfacing through the 64-bit Avalon-ST interface the
address field is in the 2nd
QW of the TLP. Therefore, the 1st QW of the TLP must be
saved until the address from the 2nd
QW is found, and then the 1st QW must be sent be-
fore the 2nd
QW. For further information on the alignment of data from the 64-bit Ava-
lon-ST interface se, [Avalon-ST interface, 4]. After the 2nd
QW is sent through the
switch, the potential data payload is sent. A state diagram of the switch state machine
can be seen in Figure 4-3.
Init
S0
S0_
wait
S1S2
S2_
wait
S2_
writeEmpty = 0
Empty = 1
Empty = 1
& Sop = 0
Empty = 1
& Sop = 1
Empty = 0
& Sop = 1
Empty = 0
Empty = 1
Ready_out = 0Ready_out = 1
Ready = 1
& Prev_eop = 0
& Empty = 1
Ready = 1
& Prev_eop = 0
& Empty = 0
Ready = 1
& Prev_eop = 1
& Empty = 1
Ready = 1
& Prev_eop = 1
& Empty = 0
Empty = 0
Ready = 1
& Eop = 0
& Empty = 1
Ready = 0 ||
(Eop = 0 &
Empty = 0) Ready = 1
& Eop = 1
& Empty = 0
Ready = 1
& Eop = 1
& Empty = 1
Empty = 0
& Sop = 0
Empty = 1
Ready = 0
Figure 4-3: State diagram of the TLP-Switch.
A short description of what happens in each state of the state machine in Figure 4-3:
Init – Waiting for input from the PCIe-module.
S0 – The first QW of the packet is saved.
S0_wait – The state machine goes to this state if there is a pause in the data stream.
S1 – The second QW of the packet is saved, and it is decided where to route
the packet. The first QW of the packet is sent.
S2 – The second QW of the packet is sent.
S2_wait –The state machine goes to this state if there is a pause in the data stream.
S2_write – Another QW of the packet is sent.
Besides routing the packets to their address, the TLP-Switch should catch unsupported
TLP types and send them to the Unsup-module. The type field of the TLP can be found
Design of the PCIe-Bridge
45
in the 1st QW of the header. The PCIe-Bridge only supports memory reads and writes,
on the incoming port. The number of output ports on the switch module and their de-
scription can be seen in Table 4-1. All the output ports use the 64-bit Avalon-ST inter-
face.
Table 4-1: The number of output ports on the TLP-Switch and a description of which
requests are send to which output port.
Number Name Description
1 PI Queue 1 Normal register reads from chip 1
2 PI Queue 2 Normal register reads from chip 2
3 PI Queue 3 Normal register reads from chip 3
4 PI Queue 4 Fast reads/writes and normal writes from all chip
5 SI Requests through the SI
6 Control Control requests
7 Extra output Output for future use
8 Unsupported Unsupported requests: I/O and Config. requests
Sending the requests to different output ports, that have different throughput, will cause
the completion of the requests to be transmitted in another order that they were re-
ceived, this is not a problem because the PCI Express interface supports split transac-
tion.
To decide whether a packet should be routed to output port 4 rather than to one of the 3
preceding output ports, the switch must know the addresses of fast register.
The addresses for the fast registers have not been written to the code, because they are
not to be found at this time in the datasheets, the addresses that are used are the ad-
dresses of the switch chips Luton28. The Luton28 is an older Vitesse switch chip with a
similar parallel interface. The addresses of the Luton28 should be change to the valid
addresses when they become available.
It is in the current version of the TLP-Switch a limitation is that there is no way of by-
passing a packet blocking the TLP-Switch. The only action that can be taken is to wait
for the blocking packet to go away, or to reset the TLP-Switch by power down. This
problem can be prevented by implementing a counter that discards the packet after a
certain amount of time or routes the packet to the Unsup-module. This feature has not
been implemented because of the lack of time.
4.2.3 TLP-Multiplex (RBS)
The TLP-Multiplex directs the completion packets from the modules back to the PCIe-
module. At this point in the design a new problem shows up, the TLP-Multiplex works
Design of the PCIe-Bridge
46
like a funnel, the data from 6 64-bit Avalon-ST interfaces should fit into a single 64-bit
Avalon-ST interface. This yields the need of prioritization of the packets from the dif-
ferent modules. The priority of a module should depend on the usage of the module.
Packets from the PI-IRQ-module have the highest priority to minimize the latency of
the interrupt message. The PI-module is the most frequently used interface under nor-
mal conditions, therefore it should have a high priority. The Unsup-module should natu-
rally have the lowest priority.
Because the PCI Express endpoint implementation in the chosen FPGA only supports
one virtual channel, we must settle with prioritizing the incoming packets instead of
mapping them to different virtual channels. The prioritizing function is made with a
priority encoder, choosing between the 6 Avalon-ST input ports of the TLP-Multiplex.
To distinguish between the Avalon-ST input ports that have packets for transmission
and the Avalon-ST input ports that do not have packets for transmission, we have made
a small alteration to the Avalon-ST interface on the TLP-Multiplex.
An Avalon-ST input port requests for permission to send a packet through the TLP-
Multiplex by asserting the StartOfPacket signal, while the ready signal is held low by
the TLP-Multiplex. When the ready signal is asserted from the TLP-Multiplex the rest
of the procedure of the altered Avalon-ST interface is as the standard Avalon-ST inter-
face. The timing diagram of the altered Avalon-ST interface is show in Figure 4-4.
Compare this with Figure 4-2 to see the effect of the alteration.
Valid
Data
StartOfPacket
EndOfPacket
Ready
CLK
Qword 1 Qword 2 Qword 3
Altered Avalon-ST
Figure 4-4: Timing diagram of altered Avalon-ST interface used in the TLP-Multiplex.
In case multiple StartOfPacket signals go high at the same time, the priority encoder
picks the one with the highest priority and directs it through to the PCIe-module. After
the chosen packet is through the ready signal goes low again. The timing diagram when
multiple StartOfPacket go high at the same time is seen in Figure 4-5.
Design of the PCIe-Bridge
47
Altered Avalon-ST
EOP_1
SOP_1
Ready_2
CLK
SOP_2
Ready_1
EOP_2
Figure 4-5: Timing diagram of altered Avalon-ST, when multiple StartOfPacket signals
are high.
Where, SOP_1 & SOP_2 are the StartOfPacket signals of input port 1 and input port 2.
EOP_1 & EOP_2 are the EndOfPacket signals of input port 1 and input port 2. It is seen
that input port 1 has the highest priority of the two input ports.
The state machine of the TLP-Multiplex is shown in Figure 4-6. In the idle state, the
priority encoder decides which incoming packet should be routed through. In the input
states, the packet is then routed through.
Design of the PCIe-Bridge
48
Idle
Input1
Input2
Input3
Input4
Input5
Input6
SOP = ”1-----”
SOP = ”01----”
SOP = ”001---”
SOP = ”0001--”
SOP = ”00001-”
SOP = ”000001”
EOP1 = ’1'
EOP2 = ’1'
EOP3 = ’1'
EOP4 = ’1'
EOP5 = ’1'
EOP6 = ’1'
EOP1 = ’0'
EOP2 = ’0'
EOP3 = ’0'
EOP4 = ’0'
EOP5 = ’0'
EOP6 = ’0'
Figure 4-6: State diagram of state machine in the TLP-Multiplex.
The priority of the different input ports of the TLP-Multiplex is seen in Table 4-2
Table 4-2: Priorities of input ports of the TLP-Multiplex.
Priority Name
1 PI interrupt
2 PI
3 SI
4 Control
5 Additional
6 Unsupported
The additional input port is for future use, this could be another type of interface. There
is also an extra output port on the TLP-Switch.
With a priority encoder, there is a chance that a packet from the ports with a low priority
will never be sent if the other interfaces keep blocking the TLP-Multiplex. To be sure
that no ports are block continuously, a weighed round robin encoder could be imple-
mented, instead of the priority encoder.
Design of the PCIe-Bridge
49
4.2.4 PI-module (RBS)
The requirements to the PI-module is the ability to communicate with 3 chips, and that
the PCIe-Bridge should be fast enough, so that it is still the parallel interface that is the
restricting factor, with regards to bandwidth.
On the chosen FPGA there is not enough I/O pins for having 3 independent parallel
interfaces. An FPGA with enough pins for three independent parallel interfaces, would
be too expensive for this solution to be feasible. The 3 Vitesse chips must be connected
to a shared bus with 3 different chip-select pins. Sharing the bus prevents the possibility
of the PCIe-Bridge being fast enough to fully utilize all 3 parallel interfaces at once, so
a compromise is that the PCIe-Bridge should be fast enough to fully utilize the shared
bus. The shared bus has the capability of having a posted read in progress on each of the
chips connected to the shared bus. In this section we will only distinguish between post-
ed read operations and all other operations, denoted non-posted reads. All non-posted
reads have the same properties concerning access time.
The maximum performance of the shared bus depends on the percentage of posted reads
and the number of posted reads in progress at the time. In Figure 4-7 the maximum
bandwidth of the shared bus is displayed, with 1 and 3 posted reads in progress. In the
same figure, the maximum bandwidth of 3 separate parallel interfaces is displayed to
show the cost of the compromise of choosing a shared bus over 3 separate parallel inter-
faces. The calculation of these curves have been made in the same way described in
section 2.2.1. With 3 posted reads in progress at the time.
Design of the PCIe-Bridge
50
Figure 4-7: The maximum bandwidth of the shared-bus when connected to 3 16-bit
parallel interfaces.
Figure 4-7 shows that the bandwidth of the shared bus when 3 posted reads in progress
is about twice as high as the bandwidth of the shared bus when 1 posted reads in
progress at 50 % of slow reads. The calculated maximum bandwidth of the shared bus is
30.77 MB/s.
In the performance calculations of the graphs in Figure 4-7 it is assumed that the posted
reads are equally distributed across the 3 switch chips, this is not an assumption that
will hold in the real application. The maximum bandwidth will decrease the more un-
evenly the posted reads are distributed.
Averaging over time will smooth out the distribution of the posted reads. Averaging
over time can be done by directing the posted reads into 3 different queues, one for each
Vitesse switch chip. When executing from the 3 queues they should then have equal
priority. The distribution of the non-posted reads does affect the bandwidth of the paral-
lel interfaces connected to the shared bus, but it does not affect the bandwidth of the
shared bus. Therefore, all the non-posted reads will be directed into a single queue. In
one single queue the non-posted reads are executed in the same order as they were re-
quested, this helps to keep the latency low. The queues will be implemented in FIFOs.
The PI-module has 4 inputs one for each queue. The TLP-Switch then decides which
queue the incoming TLP should be directed to, thereby avoiding the need to decode the
TLP more than the one time in the switch module. Only one TLP can be completed at
the time, so there will only be one output of the PI-module.
Design of the PCIe-Bridge
51
Because the execution of a posted read is relatively slow compared to the non-posted
reads, we leave out the functionality of burst-posted-reads to save the complexity. Leav-
ing out the burst posted reads functionality, will increase the amount of data needed to
be transferred through the PCIe interface. Because of the excess of bandwidth through
the PCI Express interface, this will not set any restriction on the parallel interface.
The flowchart in Figure 4-8 shows the flow of which requests are executed, with the
possibility of 3 posted reads in progress at the time. It is seen in the flowchart that there
is no support for burst-posted-reads. This limitation should be taken care of in software,
because the PCI Express standard still supports burst reads even though it is from a
normal register.
Initialize
Chips
Queue 1
Empty?
Queue 2
Empty?
Queue 3
Empty?
Queue 4
Empty?
Any slow
Reads done?
Slow read
In progress?
Slow read
In progress?
Slow read
In progress?
Execute from
Queue 1
Execute from
Queue 2
Execute from
Queue 3
Execute from
Queue 4
Complete
Slow read
Yes Yes Yes
No No No
No
NoNoNo
Yes Yes Yes
No
Yes
Yes
Figure 4-8: Flowchart for executing requests in the control state machine of the parallel
interface.
The execution of a request requires 3 tasks.
Selecting the request to execute.
Reading or writing data through the PI.
Sending the completion of the request.
Design of the PCIe-Bridge
52
The selection of the request is done by implementing the flowchart in Figure 4-8. The
read or write operation through the PI should proceed as described in the section 2.2.
The completion of the request should follow the rules found in the section 2.1.2.
The three tasks of the execution of a request should function in parallel to maximize the
utilization of the PI. Implementing these 3 tasks in 3 different state machines will help
keep the complexity low. The structure of the FIFOs and the state machines in the PI-
module is seen in Figure 4-9.
PI-module
Addr
Data
TLP
TLP
TLP
TLP
TLP
header
TLP
Data
TLP
TLP
TLP
TLP
TLP
PIFIFO 1
Slow reads
FIFO 2
Slow reads
FIFO 3
Slow reads
FIFO 4
Fast reads/
writes
PI
FSM
Control
PI
FSM
Compl
PI
FSM
Interface
Output
FIFO
Done
new_TLP
Req
Figure 4-9: Block diagram of the PI-module.
The control state machine requests a completion from the completion state machine
when it starts of a read request on the interface state machine. The interface state ma-
chine returns data and a data valid signal to the completion state machine and a done
signal to the control state machine.
To prevent the PI-module from stalling because it cannot transfer the data of the com-
pletion to the TLP-Multiplex, a FIFO is placed on the output of the completion state
machine.
When each QW of a completion is ready the QW is loaded into the output FIFO.
Between the output FIFO and the output port there is some logic for communicating
with the TLP-Multiplex. The logic asserts the StartOfPacket signal, and sends the pack-
et when the ready signal from the TLP-Multiplex goes high.
The state diagrams of the 3 state machines in the PI-module is shown in the Figure
4-10, Figure 4-11 and Figure 4-12.
Design of the PCIe-Bridge
53
Init
Header
Slw_done1 = 1 || slw_done2 = 1 ||
slw_done3 = 1 ||
(empty_1 = 0 & usedw1 > 1) ||
(empty_2 = 0 & usedw2 > 1) ||
(empty_3 = 0 & usedw3 > 1) ||
Wait for
Done
Done = 0 &
Repeat ≠ 0 &
Data_left = 1
Done = 0 &
Repeat = 0
Empty = 0
Slw_done1 = 0 & slw_done2 = 0
& slw_done3 = 0 &
(empty_1 = 1 || usedw1 ≤ 1) &
(empty_2 = 1 || usedw2 ≤ 1) &
(empty_3 = 1 || usedw3 ≤ 1) &
Done = 1 ||
(Repeat ≠ 0
& Data_left ≠ 1)
Done = 0 &
Repeat = 0
WR = 1 &
QW_align = 0
WR = 1 &
QW_align = 1
WR = 0
Empty = 1
Decision
transdata
Stall
transdata
1
Figure 4-10: State diagram of the PI ctrl state machine.
If the control state machine executes a read it goes to the Stall state, because it waits for
the interface state machine to be done with all the read operations. If the control state
machine executes a write, the state machine goes to transdata or transdata1 depending
on the QW alignment. The states transdata and transdata1supply the interface state ma-
chine with data.
Design of the PCIe-Bridge
54
Idle
Addr
Chip_selWait_st
Done_st
Req = 1
PI_nDone = 0
Req = 0
Req = 1 ||
Repeat > 0
PI_nDone = 1
Req = 0
Figure 4-11: State diagram of the PI interface state machine.
The interface state machine follows the specified pattern in section 2.2 for communicat-
ing with the PI.
Idle
Last_
head_qwWait_st
Qword_aligned = 1
|| WR = 1
WR = 0
Input_valid = 1 &
((Bytes = lengthm2 & pi16n8 = 1) ||
(Bytes = lengthm1 & pi16n8 = 0))
New_TLP = ’0'
Input_valid = 0 ||
(input_valid = 1 &
((Bytes ≠ lengthm2 & pi16n8 = 1) ||
(Bytes ≠ lengthm1 & pi16n8 = 0)))
Qword_aligned = 0
& WR = 0)
WR = 1
Figure 4-12: State diagram of the PI completion state machine.
The completion state machine makes a completion header and loads the data received
from the interface state machine into the output FIFO.
Design of the PCIe-Bridge
55
4.2.5 PI-IRQ-module (RBS)
Each switch chip has 2 interrupt pins that can be programmed to interrupt on different
settings. In the PCI Express standard INTx interrupt signaling there are four virtual
wires used for handling interrupts. We use all the virtual wire, we assign two wires to
one of the switch chips and one wire to each of the two other switch chips.
Every time an interrupt pin changes state, the PI-IRQ-module sends a message TLP
with information on which pin and if the pin was asserted or deasserted. For an interrupt
to be detected in the PI-IRQ-module, the interrupt pin must be high across a rising edge
of the internal clock in the PCIe-Bridge. In the PI-IRQ-module there is two registers for
each interrupt pin, one register is set high if the interrupt pin is set asserted the other
register is set high if the interrupt pin is deasserted. The registers are set low when the
relevant TLP message is sent.
4.2.6 SI-module (JH)
The requirements for the serial interface (SI) are, that it should be able to interface to
four different chips. Because of the restricted number of IO pins of the FPGA, the SI
will as the PI not have an independent SI to each chip, but have a chip select signal to
each chip sharing the data-in, data-out and clock signal.
The maximum performance of the SI is reached when running continuous writes on the
SI with a serial clock frequency of 25 MHz. The wait period of a read is what makes it
slower than a write. For each transaction, four bytes are transferred over the SI. The
maximum bandwidth of the SI can be seen in Table 4-3.
Table 4-3: Maximum bandwidth for SI reads and writes
SI requests Transaction time
[ns]
Bandwidth
[MB/s]
Write 2240 1.78
Read 3240 1.23
The SI must support burst reads and burst writes, and a write with maximum payload of
32 DW would take 71,68 s to finish when running at maximum performance. A FIFO
will be added to the input of the SI module as stated in section 4.1. By doing this, the
TLP is brought in to the FIFO and the PCIe-Bridge can carry on with other requests at
the same time that the write request is being executed.
The same problem will be experienced at the output if a maximum payload read request
is sent to the SI module. A read request of 32 DW would take 103,68 s to finish. If
there is no FIFO at the output of the SI module, the TLP-Multiplex would receive the
completion header, and then wait while the SI performed the rest of the reads. This
would stall the PCIe-Bridge.
These observations indicate that an input FIFO is necessary. An output FIFO that can
hold a maximum payload TLP could also solve the solution on the output of the SI-
Design of the PCIe-Bridge
56
module. A second solution could be to utilize the byte count value of the completion
TLP header. This is done by sending the completion in multiple TLPs with the byte
count indicating how many bytes remain to be read. Since the chosen FPGA has enough
area available for the FIFO implementation, that will be the chosen solution. The FIFO
solution only sends the completion header once in contrary to the byte count solution.
The FIFOs will have some logic connected, controlling the flow into the input FIFO and
out of the output FIFO. The control of the SI and the SI itself will lie in between these
two FIFOs as seen in Figure 4-13.
SI-module
Addr
Data
TLP
TLP
Data
TLP
TLP
SIInput
FIFO
SI
FSM
Control
SI
FSM
Output
SI
FSM
Interface
Output
FIFO
Done
TLP
TLP Done
SI
FSM
Input
TLP
Figure 4-13: Block diagram of the SI-module.
By using this linear model, the SI requests will be executed in the order they are re-
ceived. Also, this block diagram is structured so that the SI, which relative to the PI is
slow, is isolated from the rest of the PCIe-Bridge. The requests are received, and ex-
ecuted by the SI FSM Control and the completion is sent to the output FIFO. When a
request has been handled, a done signal is sent to the SI FSM Output, and only then the
completion TLP is sent. Sending a completion will with this setup only take as many
clock cycles as there are QWs in the completion TLP. The SI FSM Input, SI FSM Con-
trol and SI FSM Output state machines are clocked using the standard 125 MHz clock,
while the SI FSM Interface state machine runs on a clock that is at best 25 MHz.
The logic at the input controls how the TLPs are put into the input FIFO. The two-state
FSM is seen in Figure 4-14.
Design of the PCIe-Bridge
57
sop_i rem_i
valid_in = ’1' &
sop_in = ’1' &
full_i = ’0'
full_i = ’0' &
eop_in = ’1'
valid_in = ’0' ||
sop_in = ’0' ||
full_i = ’1'
full_i = ’1' ||
eop_in = ’0'
Figure 4-14: State diagram of SI FSM Input.
If the input FIFO is not full, the input FSM waits for a StartOfPacket and valid signal
from the TLP Switch before it starts putting data into the queue. When an EndOfPacket
signal is received, the packet has been placed in the input FIFO, and the input logic
starts waiting for a new StartOfPacket from the TLP-Switch.
Immediately when the input FIFO is not empty, the SI FSM Control starts to handle the
request. The state diagram can be seen in Figure 4-15.
idle
header
header_2
get_
data64q
interface
check_
done
empty_i = ’0'
empty_i = ’0' &
full_o = ’0'
(b32_64 = ’0' & q_i(4) = ’0') ||
(b32_64 = ’1' & WR_s = ’1')
(b32_64 = ’0' & q_i(4) = ’1') ||
(b32_64 = ’1' & WR_s = ’0')
done = ’1'
empty_i = ’1'
empty_i = ’1' ||
full_o = ’1'done_
write
done_
read
check = ”11" &
done = ’0' & full_o = ’0'
done = ’0'
read_
next
check = ”00" &
done = ’0' & full_o = ’0'
done = ’1' ||
full_o = ’1'
write_
next
addr0(2) = ’1'
addr0(2) = ’0'
check = ”10" &
done = ’0' & full_o = ’0'
check = ”01" &
done = ’0' & full_o = ’0'
header_sent = ’1' ||
addr0(2) = ’0'
header_sent = ’0' &
addr0(2) = ’1'
Figure 4-15: State diagram for SI FSM control.
Design of the PCIe-Bridge
58
The SI FSM control starts by fetching the TLP header from the input FIFO and saving it
to registers. Then it is determined whether it is necessary to get more data from the in-
put FIFO by checking if the address is QW aligned or the address length is 64-bits.
After each read or write, the SI FSM Control checks, according to the Avalon-ST rules,
whether: the TLP request is done, new data has to be fetched from input FIFO or if eve-
rything is ready to just start the SI one more time. In contrary to the PI module, the SI
FSM Control both handles the request and sends a completion to the output FIFO. In the
PI module, this is done using an external block handling completions. The external state
machine would save away the four states: read_next, write_next, done_read and
done_write. This would only save one, and at best two clock cycles per TLP request.
The SI FSM Control runs on a 125 MHz clock, while the SI FSM Interface runs on a
clock with maximum frequency 25 MHz. Therefore, the SI FSM Control will handle
read_next and write_next states before next rising edge in the SI FSM interface. This
means that which means that at best 16 ns would be saved. A single write request,
which is the transaction taking the least time, takes 2240 ns, and relative to this, the
16 ns more used by the SI FSM Control is accepted.
When the SI FSM control reaches the interface state, an SI read or write is started. The
SI read or write is executed using the FSM seen in Figure 4-16.
idle
send_
addr
req = ’1'
write_st
wait_
state
WR = ’0' &
addr_counter = 0
read_st
finish
data_counter = 0
data_counter = 0
WR = ’1' &
addr_counter = 0
counter = 1
req = ’0'
addr_counter > 0
data_counter > 0
counter > 1
data_counter > 0
Figure 4-16: State diagram of interface state machine in the SI-module.
Design of the PCIe-Bridge
59
The SI interface FSM is derived from the timing diagrams of the SI interface seen in
section 2.3. After data has been read or written, a done signal is sent back to the SI FSM
Control. Because of the difference in clock frequency, the SI FSM Control waits, for the
SI FSM Interface to go back to idle state, before continuing from the check_done state.
When a TLP read or write request is done, the last QW of the completion TLP is put
into the output FIFO. The counter indicating how many whole TLPs are in the output
FIFO is incremented by one. When this counter is more than zero, the output logic as-
serts the StartOfPacket signal at the output in the sop_o state seen in Figure 4-17.
done
idle
sop_o
rem_o
counter > 0
ready_out = ’1'q_o(0) = ’1'
counter = 0
ready_out = ’0'
q_o(0) = ’0'
Figure 4-17: State diagram of output state machine in the SI-module.
The FSM output logic implements the process of sending TLPs to the TLP multiplexer,
by after the sending the remainder of the packet in the rem_o state. After this it goes
back to check if there is a whole TLP in the output FIFO.
Design of the PCIe-Bridge
60
4.2.7 Control-module (RBS)
The Control-module is controlling GPIOs and internal control signals of the PCIe-
Bridge. As mentioned in section 2.4 there are 3 pins that should be controlled. Internally
in the PCIe-Bridge, the completer ID signal should be passed around to all modules that
return completions. The completer ID can be fetched from the PCIe-module, through
the tl_cfg interface. This interface has an address output and a data output. The Control-
module should refresh the completer ID every time the address of the tl_cfg interface
gets to the address of the completer ID. The GPIOs should be controlled from the exter-
nal CPU, this is done by sending memory requests. The address space of the Control-
module is so small that there is no need for the Control-module to support burst re-
quests. The address space of the Control-module is seen in Table 4-4.
Table 4-4: Address space of the Control-module.
Address Signal
0000 EP_ID
0001 nRESET
0010 VCore_cfg0
0011 VCore_cfg1
Rest Reserved
The Control-module has an input FIFO in the input and an output FIFO on the output.
In between the two FIFOs there is a state machine for executing the requests. The state
diagram of the state machine controlling the execution of requests can be seen in Figure
4-18.
Design of the PCIe-Bridge
61
Idle
Addr_st
ExecRead_st
Compl
Error
Sop = 0 ||
usedw_in ≤ 1||
usedw_out ≤ 60
sop = 1 &
usedw_in > 1 &
usedw_out < 60
(Wr = 1 &
Last_eop = 1) ||
(Wr = 0 &
addr(2) = 1
Last_eop = 1)
Wr = 0 &
Addr(2) = 0
(Wr = 1 &
Last_eop = 0) ||
(Wr = 0 &
addr(2) = 1
Last_eop = 1)
Eop = 1
Eop = 0Prev_last_eop = 0
Prev_last_eop = 1
Eop = 1Eop = 0
Figure 4-18: State machine of the Control-module.
4.2.8 Unsup-module (JH)
If a TLP is received at the TLP Switch, that is not of the type Memory read or write, the
request is not supported. The packets are then sent to the Unsup-module, where the
packet first is put into a FIFO. Then the transaction ID is fetched from the header of the
TLP, and from this a completion TLP is sent back through the TLP multiplexer with the
completion status field set to unsupported request. The state diagram for the FSM mak-
ing the completion TLPs is seen in Figure 4-19.
Design of the PCIe-Bridge
62
fin_
complfin_req
sop = 1 &
usedw_in(1) = 1 &
usedw_out < 60
eop = 0
eop = 1
idle
eop = 1
eop = 0
sop = 0 ||
usedw_in(1) = 0 ||
usedw_out ≥ 60
Figure 4-19: State diagram for unsupported TLP module.
The idle and fin_compl states send the completion TLP to the output FIFO. The fin_req
state then removes the rest of the request from the input FIFO.
4.3 Address space of the PCIe-Bridge (JH)
The PI interface for the Jaguar switch has an address length of 24-bits. The address
space of the PCIe-Bridge needs to have room for three of the 24-bit address PI. In addi-
tion to this there must also be room for four SI components with a 23-bit address space
each. For the address space, this means that the minimum size address space for the
PCIe-Bridge is 27-bits.
The TLP Switch uses address based routing of packets. The PCIe-Bridge endpoint must
be set to have a 27-bit address space. Table 4-5 shows how the address space is utilized
in the PCIe-Bridge application and how it is determined where the packets are routed.
Design of the PCIe-Bridge
63
Table 4-5: Address space of the PCIe-Bridge.
Addr
[26:24]
Address space for target device
(Chip select to device)
000 PI device 1 (PI_nCS0)
001 PI device 2 (PI_nCS_SLAVE)
010 PI device 3 (PI_nCS_FPGA)
011 0 SI device 1 (SI_nCS0)
1 SI device 2 (SI_nCS_SLAVE)
100 0 SI device 3 (SI_nCS_FPGA)
1 SI device 4 (SI_nCS_FLASH)
101 000000000000000000000 The Control-module
------------------- Reserved
110 Reserved
111
The three most significant bits determine which module or component is targeted. In the
case of SI the fourth most significant bit also determines which of the SI components is
targeted.
The address space is not fully occupied and additional functionality that requires ad-
dress space is possible.
4.4 Expected performance (RBS)
The worst-case bandwidth of the PCI Express is when the requests are 100 % single
DW writes with 64-bit addressing, such a request takes up 20 bytes and contain 4 bytes
of data. In this case the effectiveness of the PCI Express is 20 %, giving a maximum
bandwidth of 40 MB/s of the PCI Express.
To get an idea of the expected performance of the PCIe-Bridge, only the PI-module and
the SI-module should be taken into account. The other modules will be used for initia-
lizing the PCIe-Bridge and very little during runtime.
The maximum bandwidth of the PI, seen in Figure 4-7, plus the maximum bandwidth
of the SI, seen in Table 4-3 is 32,55 MB/s.
Comparing to the performance of the PI and SI, to the bandwidth of PCI Express, we
see that even if none of the requests are burst requests it is still the PI and SI that is the
limiting factor.
65
5 IMPLEMENTATION OF THE PCIE-BRIDGE
5.1 Implementation of HDL (RBS)
The components in the structural design have been simulated along with their imple-
mentation as far as possible (see section 6 for information on the simulation setup). The
first goal of the implementation was to get the routing of packets to work. To verify that
the routing of packets through the system worked as expected, we implemented a dum-
my interface for emulating the real blocks connected to the interfaces. This dummy in-
terface was built in between the TLP-switch and the multiplexer. The only functionality
of the dummy interface is to pass on the packets from the single input to the single out-
put. This setup can be seen in Figure 5-1.
Ha
rd IP
PC
I Exp
ress m
od
ule
(PC
Ie-m
od
ule
)
The TLP multiplexer
(TLP-Multiplex)
Du
mm
y In
terfa
ce
(Du
mm
y-m
od
ule
)
Du
mm
y In
terfa
ce
(Du
mm
y-m
od
ule
)PCI Express Bridge (PCIe-Bridge)
The TLP switch
(TLP-Switch)
Du
mm
y In
terfa
ce
(Du
mm
y-m
od
ule
)
Du
mm
y In
terfa
ce
(Du
mm
y-m
od
ule
)
No
t co
nn
ecte
d
Figure 5-1: Structural block diagram of PCIe-Bridge for testing routing of packets.
With the dummy interface in place, we tested the ability of looping back the incoming
packet to the PCIe-module.
With the routing of the packets working properly, the next step is to implement each
control unit. The control units are substituted into the testbench separately and tested,
one at the time.
Implementation of the PCIe-Bridge
66
We used the tools Quartus II for implementation and ModelSim for simulation of our
project. In Quartus II there is an add-on called SOPC builder. It was our intention to use
this SOPC builder to make the structural implementation, and then write the Control
modules ourselves in VHDL. It turned out that SOPC builder is not very efficient when
making designs containing user-defined components. We skipped the idea of using
SOPC builder, and instead used the MegaWizard in Quartus II for configuring the PCIe-
module and for configuring all the FIFOs used in our design. See Appendix D for the
source code we have written ourselves. See Table 5-1 for the source files associated
with the different modules of the design.
Table 5-1: Filenames of the implemented modules.
Modules Files
Top level entity PCIe_bridge.vhd
Top level entity of simulation PCIe_testbench.vhd
PI-IRQ-module PI_interrupt.vhd
PI-module PI_control.vhd
PI_FSM_Ctrl.vhd
PI_FSM_Interface.vhd
PI_FSM_Compl.vhd
PI.vhd
VSC74xx_PI.vhd
SI-module SI_module.vhd
SI_control.vhd
SI_ interface.vhd
SI_input_FIFO_logic.vhd
SI_output_FIFO_logic.vhd
CLK_DIVIDER.vhd
SPI.vhd
VSC74xx_SPI.vhd
Control-module control_reg.vhd
Unsup-module unsupported_tlp.vhd
TLP-Switch tlp_switch.vhd
TLP-Multiplex tlp_miltiplex.vhd
PCIe-module for simulation PCIe_module_testbench.vhd
5.2 Design of interface board (JH)
In order to verify the design of the PCIe-Bridge, some real-life testing needs to be done.
For the Cyclone IV GX FPGA, there is no evaluation kit available, that makes it possi-
ble to test the designed PCIe-Bridge against the Luton26 or the Jaguar. Therefore, a
PCIe-Bridge interface board will be designed for the purpose. The interface board will
be made as an add-on board to support the reference boards made by Vitesse for the
Implementation of the PCIe-Bridge
67
Luton26 and the Jaguar chips. The reference boards have connectors, on which the pa-
rallel and the serial interface signals of the chips are accessible. The interface board
designed for the PCIe-Bridge will sit on top of one of these connectors. The MPLS
module can be connected using a cable. When the interface board sits on top of a Lu-
ton26 or Jaguar reference board, the PCI Express will have to connect to the external
CPU using a PCI Express cable.
The steps involved in designing an interface board for the PCIe-Bridge are:
Determining what components are necessary for the PCIe-Bridge interface board to
function properly.
Making the schematics, where the components are connected.
Making the layout of the board.
5.2.1 Components
In order to determine what components are necessary on the interface board, it is neces-
sary to take a look at what functionality is necessary for the FFPGA to function proper-
ly.
The FPGA needs to be programmed on power-up. This need demands either a connector
to program the FPGA, or a serial configuration device (EPROM). Connectors for the
reference boards and one for PCI Express also need to be featured. In addition, the
EPROM needs to be configured, and connector is needed for this purpose.
Overall, the necessary components for a PCIe-Bridge reference board may be listed as
follows
FPGA: The Cyclone IV GX FPGA in which the PCIe-Bridge has been imple-
mented.
EPROM: To program the FPGA on power up, and avoid an external system to
stand by at every power up.
JTAG connector: For the possibility of direct programming of the FPGA
EPROM programming connector: For configuration of EPROM devices.
PCI Express connector: Connector to which the PCI Express can connect.
External Oscillator: An 125 MHz clock generator.
Reference board connectors: A connector for a Jaguar reference board, another
for the Luton26 reference board and a third for the MPLS FPGA module.
Power supplies: DC/DC converters and linear regulators to generate the different
supply voltages required by the FPGA.
Supply decoupling: Capacitors to decouple and shunt electrical noise from the
FPGA.
For the JTAG and Altera EPROM connectors, Vitesse have developed an expansion
board for their MPLS board that contains these connectors. This expansion board will
also be used for the PCIe-Bridge interface board to save area, so instead of a JTAG and
an EPROM programming connector, a connector for this expansion board is featured.
Implementation of the PCIe-Bridge
68
5.2.2 Schematics
The schematics connect the different components together. An overview of the PCIe-
Bridge interface board is seen in Figure 5-2
Cyclone IV FPGA -
EP4CGX15
EPROM -
EPCS4
Oscillator
Power supplies and
Decoupling capacitors
Connector
For JTAG and
Byteblaster II
board
PCI Expess
connector
Connector
Luton26 board
Connector -
Jaguar board
Connector -
MPLS board
Figure 5-2: Overview of components on the PCIe-Bridge interface board
The FPGA used in the design of the PCIe-Bridge is the smallest in the Cyclone IV GX-
series. The chosen FPGA comes in two different packages. The packages are a 169-pin
Fine Ball Grid Array (FBGA) and a 144-pin Quad Flat No leads (QFN). The FBGA
package uses more area on the PCB than the QFN package, and the FBGA package is
also more expensive than the QFN package. These two factors favor the QFN package,
but it is only the smallest FPGA in the Cyclone IV GX series that comes in the 144-pin
QFN package. This removes the possibility of upgrading the number of logic cells on
the same PCB. The three smallest Cyclone IV GX FPGAs come in the 169-pin FBGA
package. Because of this, the 169-pin FBGA is the package of choice for this add-on
board.
5.2.2.1 Assigning FPGA pins
On the FPGA, the IO signal pins are split into IO banks. It is important, when assigning
IO signals to pins, that signals that are dependent of each other, are assigned to IO pins
in the same IO bank as much as possible. This makes is done in order to make it easier
to meet timing constraints in the design [5]. Because of this, the serial interface, general
purpose IOs and possible additions are placed in one single IO bank, and then the paral-
Implementation of the PCIe-Bridge
69
lel interface occupies the rest of the available IO pins making it use up all the other IO
banks available.
How the pins are assigned according to each IO bank can be seen in Appendix B.2.
5.2.2.2 EPROM and Connectors
It now has to be determined more specifically:
What EPROM is needed for programming the FPGA
What connectors are needed, and where they are placed on the PCB
Altera recommend using their own EPROMs with Altera FPGAs. Since this add-on
board will be used mostly for testing, and not be produced in high volume, the extra
price that this may cause will be overlooked. The chosen EPROM is therefore the Altera
EPCS4. This is the smallest EPROM that supports the chosen FPGA [9]. The Altera
EPCS4 is configured through an Altera ByteBlaster II connector.
The connector for the Luton26 board consists of a 2x25-pins male pinrow, and the Ja-
guar connector consists of two 2x25-pins female receptable sockets. These two connec-
tors will be placed on the bottom side of the board. The add-on board will then be able
to sit on top of the connectors on the Vitesse reference boards.
The MPLS module will be connected using a cable. The connector needed for that cable
is two times 2x25-pins male pinrows. These pinrows will be placed on the top side of
the board so they can be accessible by cable. The MPLS connector will be placed on the
top of the PCB where the Jaguar connector is placed on the bottom. This is done so that
signals that connect to equivalent pin on the MPLS connector and the Jaguar connector
can be routed right through the board. This makes the board easier to layout.
The expansion board featuring JTAG and Altera ByteBlaster II connectors is also con-
nected through a 2x25-pins pinrow, which is placed on the top side of the PCB, at the
same location where the Luton26 connector sits on the bottom side.
The PCI Express connector is from Molex. The connector is an 18-pin PCIe x1 cable
connector. The connector is an angled connector placed on the edge of the top side of
the PCB. An overview how the connectors, and the other components, are placed on the
PCB can be seen in Figure 5-3.
5.2.2.3 Power Supply
The power supply for the interface board is received either through the Luton26 connec-
tor or through the Jaguar connector. The interface board supply voltage in can be seen
Table 5-2
Implementation of the PCIe-Bridge
70
Table 5-2: The supply voltage for the PCIe-Bridge interface board, for each of the ref-
erence boards.
Vitesse chip
Ref. board
Supply voltage
[V]
Jaguar 3.3
Luton26 2.5
3.3
The Luton26 reference board with a supply voltage of 2.5 V has not been developed yet,
but the possibility must be considered for future applications.
There are three components requiring power supplies. An overview of what these com-
ponents demand is listed in Table 5-3
Table 5-3: Needed supply voltages by components
Component Part of
component
Required
Voltage [V]
EPROM Core 3.3
Oscillator Core 3.3
FPGA
Core 1.2
PCIe Hard IP 2.5
IO banks PCIe-Bridge
Supply voltage
Table 5-3 shows that there are three voltage levels required for the components on the
PCIe-Bridge reference board. When the input supply voltage is 3.3 V, the 2.5 V supply
is made using a linear regulator. A linear regulator is not an effective way to transform
voltage levels, but since the current supplied to the PCIe Hard IP is low (Appendix B.4)
and the span the voltage is transformed is relatively low, the loss in the linear regulator
is acceptably low [10]. When the interface board supply voltage is 2.5 V, the linear reg-
ulator is bypassed.
The 1.2 V supply is acquired using a DC/DC step-down converter. The DC/DC conver-
ter is used because it is more effective than the linear regulator, and because the core of
the FPGA consumes more power than the PCIe Hard IP (Appendix B.4). This makes
the need for an effective power supply bigger.
When the supply voltage for the interface board is 2.5 V, a DC/DC step-up converter is
used to supply the components requiring 3.3 V.
To ensure that electrical noise that may come from the FPGA is not allowed to affect
the whole system, some decoupling capacitors of the supplies is needed. The number of
Implementation of the PCIe-Bridge
71
capacitors needed was found by Vitesse workers. The decoupling and the power sup-
plies can be seen in the schematics in Appendix B.1.
Overall, the PCIe-Bridge interface board will look, something like that depicted in Fig-
ure 5-3.
FPGA
EP-
ROM
Ma
le –
MP
-LS
Ma
le –
MP
-LS
OSC
PCIe
FPGA
(Bottom)
Fe
ma
le –
Ja
g-u
ar
Fe
ma
le –
Ja
g-u
ar
Ma
le –
Lu
t-o
n2
6
Linear
regulator
1,2V
supply
Top BottomF
em
ale
- E
xp
an
sio
n
bo
ard
3,3V
supply
Area available
for decoupling
capacitors
Figure 5-3: An overview on how the components will be placed on the PCB.
The area used by components in Figure 5-3, is the actual area used relative to each oth-
er. The space between the two connectors making the Jahuar connectors also fits that of
the Jaguar reference board. The actual size of the PCB though is not fixed, it may
change during layout where area may be saved.
The total cost for this reference board can be seen in Appendix B.3.
5.2.3 Layout
The layout for the PCIe-Bridge reference board has not been made. This task was to be
made by engineers at Vitesse, but since the Cyclone IV FPGA could not be delivered in
time for testing and verification, the decision to not make the layout and order the refer-
ence board was made in order to meet the deadline of the project.
So a customer opting to go for the PCIe-Bridge interface board needs to layout the
board before being able to apply it.
When produced, this PCIe-Bridge interface board can be used to test and verify the im-
plemented functionality of the PCIe-Bridge.
73
6 TESTING AND VERIFICATION
6.1 Simulation (RBS)
As explained in section 5.1 the simulation has been done in parallel with the implemen-
tation and the testbench has been extended continuously as the need for more testing
functionality arose.
6.1.1 Structure of testbench
When the PCIe-module is instantiated in MegaWizard there is a testbench included, this
testbench works by sending chaining DMA requests through the PCI Express interface.
We have tested the PCIe-module with the included testbench to make sure that the PCI
Express module works. There are two problems with this testbench. First, there is no
documentation of how to alter the requests that are sent. Second, it is a very large test-
bench, which takes a long time to simulate. Instead, we have made our own testbench
that does not communicate through the PCI Express but communicates through the Ava-
lon-ST interface. This simplifies the testbench and enables us to control the packet.
In the top module of the testbench, the PCIe-Bridge and the non-synthesizable models
of the parallel interface and serial interface are instantiated. Inside the PCIe-Bridge, the
auto-generated PCIe-module is replaced with a PCIe-module_testbench. The block dia-
gram of the testbench is shown in Figure 6-1.
Testing and verification
74
PCIe_testbench.vhd
SP
I.vh
d
VSC74xx_SPI.vhd
VSC74xx_SPI.vhd
VSC74xx_SPI.vhd
VSC74xx_SPI.vhd
PI.vh
dVSC74xx_PI.vhd
VSC74xx_PI.vhd
VSC74xx_PI.vhd
PCIe_Bridge.vhd
PI_
co
ntr
ol.vh
dS
I_co
ntr
ol.vh
d
PC
Ie_
mo
du
le_
testb
en
ch
.vh
d
Input File
Output File
Figure 6-1: Block diagram of testbench.
The PCIe-module_testbench has the same inputs and output as the PCIe-module, the
only signals that is used by the PCIe-module_testbench is the 64-bit Avalon-ST input
and output, all other signals are left open or set to zero. Inside the PCIe-
module_testbench, there is a file reader and a file writer. The file reader reads the input
file and each line of the input file is set on the Rx-port in one clock cycle. The file writer
writes the data from the Tx-port to a file. The structure of the packets in the test files
can be seen in Appendix C.1.
6.1.2 Methodology of testing
Two different test methods are used to test the PCIe-Bridge. The first test method is for
testing the routing of packets through the PCIe-Bridge. The second test method is used
to test the functionality of each module of the PCIe-Bridge.
To test the routing of packets, the setup of the PCIe-Bridge is as seen in Figure 5-1. In
this test, TLPs are sent through the PCIe-Bridge. To verify a test of this type, the output
packets are compared to the input packets, and the route of each TLP is compared to the
expected route. In a successful test the input and output packets should be identical, and
arrive at the output in the same order as they were transmitted from the input.
When testing the functionality of each module in the PCIe-Bridge, the setup of the
PCIe-Bridge should be as in Figure 4-1. To verify the test the input packets are com-
pared to the output packets. In a successful test, all the requests sent through the input
should return a correct completion to the output. The PCI Express interface support split
Testing and verification
75
transactions and each packet has its own tag and therefore, it does not matter in which
order they are received at the output.
All test files has been listed in Appendix C together with their expected output files.
With this approach, it will be easier to make changes to the VHDL code and check that
no unexpected behavior has been introduced to the PCIe-Bridge.
6.1.3 Verifying packet routing
It is only possible to test 6 of the output ports of the TLP-Switch because the TLP-
Multiplex only has 6 input ports. To verify that the routing of packets is done in the cor-
rect way, we use the testing method described in section 6.1.2. Furthermore, the routing
should be tested under maximum load, without breaks between the packets. The binary
files from the performed tests can be seen in Appendix C.2-C.3. The result showed that
the input and output packets were identical and arrived at the output in the same order as
they were transmitted from the input. The packets should be directed out of the TLP-
Switch ports 1 – 6, in incrementing order. A wave diagram of the StartOfPacket signals
of the TLP-Switch output ports, can be seen in Figure 6-2.
Figure 6-2: StartOfPacket signals of TLP-Switch, while testing routing of packets.
In the wave diagram it is seen that the StartOfPacket signals go high in the right order.
This test was performed under maximum load with no spare clock cycles in between
packets. It is hereby verified that the packets are routed correctly through the system.
6.1.4 Verifying functionality of modules
For testing and verifying functionality in the different modules, we use the testing me-
thod described in section 6.1.2. To test the functionality of the PCIe-Bridge we list the
testable functionality in Table 6-1.
Testing and verification
76
Table 6-1: Testable functionality of modules in the PCIe-Bridge.
Module name Testable functionality
PI-IRQ-module Assertion/Deassertion of each interrupt pin, multiple at the time.
PI-module Read/write & burst/non-burst & QW/non-QW aligned address &
posted reads to all switch chips at the same time
SI-module Read/write & burst/non-burst & QW/non-QW aligned address &
from all the different chips
Control-module Read/write
Unsup-module Return unsupported request completion
In the following sections, we go through the testing and verification of the different
modules.
6.1.4.1 PI-IRQ-module
It is the switch chips that drives the interrupt pins and therefore, there will only be sent
packets from the module. The interrupt pins are asserted and deasserted in the simulat-
ing model of the 3 parallel interfaces in the testbench. To test the functionality of the PI-
IRQ-module the interrupt pins are asserted and deasserted, one at the time and after-
wards multiple at the time, all assertions and deassertions should result in the PI-IRQ-
module sending a message TLP, the interrupt pulses are 10 ns wide. These message
TLPs can be accounted for in the output file. In Figure 6-3 a timing diagram of the in-
terrupt pins in the test of the PI-IRQ-module is shown.
Figure 6-3: Timing diagram of interrupt signals, when testing PI-IRQ-module.
First each of the interrupt pins are asserted and deasserted one at the time and after-
wards all the interrupt pins are asserted at once and then desserted at once. In Table 6-2
the message codes for each of the received message TLPs are shown, the file from this
test can be seen the Appendix C.4-C.5.
Testing and verification
77
Table 6-2: Results of PI-IRQ-module test.
Message code Message name
0x23 Assert_INTD
0x21 Assert_INTB
0x20 Assert_INTA
0x22 Assert_INTC
0x24 Deassert_INTA
0x25 Deassert_INTB
0x26 Deassert_INTC
0x27 Deassert_INTD
0x20 Assert_INTA
0x21 Assert_INTB
0x22 Assert_INTC
0x23 Assert_INTD
0x24 Deassert_INTA
0x25 Deassert_INTB
0x26 Deassert_INTC
0x27 Deassert_INTD
It is seen in the results that all the message TLPs arrive at the output. It is hereby veri-
fied that the PI-IRQ-module is working properly.
6.1.4.2 PI-module
The PI-module should be able to execute requests from the 4 different queues and it
should be possible to have a posted read in progress on each of the 3 chips at the same
time. The non-posted reads should support burst requests. To ensure that the data is
aligned correctly, all of these requests should be tested with and without a QW-aligned
address.
The data returned by the PI is the address shifted 1 to the right. This enables us to check
if the address is incremented correctly. Through an 8-bit parallel interface, the data from
every other read should increment by 1. Through a 16-bit parallel interface, the data
from every read should increment by 1.
All tests should be run with 8-bit and 16-bit parallel interface. The binary test files for
these test are shown in Appendix C.6-C.8. All requests returned a correct completion. It
is verified that the PI-module, can execute the requests that it should support. To verify
that the PI-module have the performance characteristics that it was designed to have,
there is a need for another test. We run a series of tests with different composition of
posted reads and non-posted reads, the test files for these tests can be seen in Appendix
Testing and verification
78
C.9-C.15. These tests are plotted together with the graph of the maximum bandwidth for
at shared bus and with 3 posted requests in progress at the time, in Figure 6-4.
Figure 6-4: Measured bandwidth of the implemented PI-module, as a function of
posted reads. When the all the 3 PIs are 16-bit wide.
In the performance test, the posted reads to each chip are equally distributed, to get the
maximum bandwidth of the PI-module. The measured values are a little lower than the
theoretical values, this is because the test only executes around 10 requests, the meas-
ured values would be closer to the theoretical values if more request where executed.
We see in Figure 6-4 that the measured values have the same tendencies as the theoreti-
cal values. It is hereby verified that the PI-module is working properly
6.1.4.3 SI-module
The SI-module should support the same request as the PI, except for posted reads. The
test files used to test the SI-module is seen in Appendix C.16-C.17.
All the requests have a matching and correct completion. It is hereby verified that the
SI-module is working properly.
6.1.4.4 Control-module
To test the control module, there should be a write TLP and a read TLP in the input file,
to the same control register. In a successful test the read TLP should return the same
data as just written by the write TLP. A limitation of this module is that it does not sup-
port burst reads and writes. To ensure that the module does not need a reset because a
Testing and verification
79
burst read or write have been sent, the same test as described above should be done after
a burst read and a burst write has been sent. In Appendix C.18-C-19 the test result is
seen. For all the supported requests there were an correct completion with the expected
data. It is hereby verified that the Control-module is working properly.
6.1.4.5 Unsup-module
The Unsup-module should catch all TLPs with types other than memory requests. For
each caught TLP the Unsup-module should return an “unsupported request” completion.
To test the Unsup-module, TLPs with all possible types are sent and for all the unsup-
ported requests, a correct completion should be returned. See Appendix C.20-C.21 for
the input and output files of this test. All the completions have the completion status of
unsupported requests and the requester ID and completer ID is set correctly. It is hereby
verified that the Unsup-module is working properly.
6.2 Synthesis
6.2.1 Hardware usage (JH)
The synthesis report gives an overview of total amount of hardware used on the FPGA
to implement the PCIe-Bridge. To make sure that the use of hardware is expected, the
components must be investigated to see whether the hardware usage found in the syn-
thesis report is the same as expected from the HDL implementation.
When synthesizing the PCIe-Bridge, the synthesis report gets the total hardware usage
of the design to be as seen in Table 6-3
Table 6-3: Total hardware usage of the PCIe-Bridge design on FPGA
Resources Usage
LC combinationals 4175
LC registers 2943
Memory bits 55136
Pins 64
An LC register is a 1-bit register. The FPGA has 14400 logic cells (LC) and 552960
memory bits, so there is no problem for the FPGA to hold the designed PCIe-Bridge,
when it comes to area.
For the pins, 60 are counted for in Table 3-3. The PCIe interface accounts for 3 pins,
and serial interface has two output pins for the serial clock, increasing the number of
pins used by the serial interface by one. From this, the 64 pins are all accounted for.
The only components using the Memory bits are the FIFOs. Not all of the FIFOs though
are implemented using block ram, some FIFOs are made using registers to increase per-
formance. The FIFOs all have a data width of line being 66-69 bits. The FIFOs vary in
Testing and verification
80
depth, and especially the input FIFO for the SI module is deeper than the rest. An over-
view of the FIFOs used and their expected memory usage is seen in Table 6-4.
Table 6-4: Expected memory bit and register usage of implemented FIFOs
FIFO compo-
nent name
# of
FIFOs
FIFO
data
width
FIFO
data
depth
Bit
usage
pr.
FIFO
Memory
bit
usage
Register bit
usage
(For queue)
Output_FIFO 6 67 64 4288 25728 0
PI_fastFIFO 1 69 128 8832 8832 0
PI_slowFIFO 4 67 32 2144 8576 0
SI_input_FIFO 1 66 128 8448 8448 0
SI_output_FIFO 1 66 64 4224 4224 0
Unsup_FIFO 2 66 4 264 0 528
Switch_FIFO 1 66 4 264 0 264
Total 16 55608 792
The expected number of memory bits exceeds the actual number memory bits given in
Table 6-3 by just under 500 bits. This indicates that there are some unused bits in some
FIFOs that are synthesized away. The total use of memory though is in the region of
what is expected.
The synthesis report also shows, that each FIFO and the PCI Express module use
around 40-45 LC combinationals and 30 LC registers. This means that the FIFOs and
the PCIe-Module together will use close to 1200 LC registers.
Looking at the register use for the PI FSM control in the PI module, the header of a TLP
will be saved to registers when executing a posted read from each of the three designat-
ed FIFOs. In addition, the header that is being executed will be saved to registers result-
ing in a total of four TLP headers being saved to registers. For this purpose then, there is
a need for 16 32-bit registers resulting in 512 LC registers. In addition, the PI FSM con-
trol saves some data and holds some counters in registers, resulting in somewhere
around 620 LC registers in total for the PI FSM control.
An analysis over all the LC registers for each component in each module is done. The
expected LC register usage is seen in Table 6-5.
Testing and verification
81
Table 6-5: Expected LC register usage for the PCIe-Bridge design
Module Comonent Number of LC
registers expected
PI-Module
PI FSM Control 620
PI FSM Completion 90
PI FSM Interface 25
PI Interrupt 25
SI-Module
SI FSM Control 300
SI FSM input and
SI FSM output
10
SI FSM interface 50
TLP Switch 150
TLP Multiplexer 10
Control registers and GPIOs 600
Unupported TLP 35
Dummy Interface 5
Registers from FIFOs and PCIe module 1200
Expected total of LC registers 3120
The expected total number of LC registers is around 200 more than the actual number of
LC registers. This is difference in LC registers may be found in the synthesis process,
where the synthesis tool optimizes the design and by doing that saves some LC regis-
ters.
Altera Quartus II does not give a detailed description of what combinational logic has
been used and where it has been used. Because of this, the number of LC combination-
als will not be investigated further.
The hardware usage of the PCIe-Bridge on the FPGA is as expected and fits on the cho-
sen FPGA.
6.2.2 Meeting the timing constraints (RBS)
When synthesizing the PCIe-Bridge the timing constraints were not met at first. After
making alterations in the code to minimize the levels of logic in paths that did not meet
the timing constraints, the timing was still not closed. The paths that did not meet the
timing constraints, were mainly paths routed to and from FIFOs. At this stage all the
Testing and verification
82
FIFOs were implemented in block ram, so to reduce the timing lost in routing, we in-
structed Quartus II to implement some of the FIFOs in registers instead of block ram.
This makes the placing and routing of the design more flexible and it made Quartus II
able to close the timing.
In the PI-module there have been inferred some multiplexers, to handle the requests
from the four different queues. These multiplexers showed up in many of the critical
paths around the bridge. The alterations we made to reduce the critical paths were main-
ly around these multiplexers, some of the alterations involved minor changes in the data
flow of the design, and others were solved by optimizing the way the code was written.
One of the changes we made was to pass on the chip address, used to determine on what
chip each request should be executed on, from the TLP-Switch to the PI. This change
made in unnecessary to interpret the chip address again and thereby saving a multiplex-
er.
Finally we managed to meet the timing constraints of the PCIe-Bridge on the slowest
speed grade of the Cyclone IV chip series - the EP4CGX15BF14C8 - which also helps
to reduce the component cost of the PCIe-Bridge.
83
7 CONCLUSION
7.1 Results
In this project we have designed and implemented a working PCIe-Bridge, as proposed
by the original project description from Vitesse. We have achieved that by seeking an-
swers to the five questions we asked in section 1.2 and we have used the questions to
structure our work and this report.
The functional requirements for our PCIe-Bridge were defined as a compromise be-
tween the full blown set of functions Vitesse could wish and what could realistically be
achieved by two bachelor students. From the investigation of the requirements, we
found that the PCIe-Bridge should implement all three layers of a PCI Express endpoint
and support the split transaction capability of the PCI Express interface. Furthermore,
the PCIe-Bridge should be able to interpret the packet-based requests from the PCI Ex-
press interface, and process them through the parallel interface or the serial interface.
The implementation should be a compromise between cost and functionality.
From the identified requirements to the PCIe-Bridge, we derived that the needed FPGA
should have an IP core for PCI Express communication and enough I/O pins for a
shared parallel interface bus. The shared bus of the parallel interface is a compromise to
minimize the number of used pins, and thereby keeping the price of the device low. The
Altera Cyclone IV GX (EP4CGX15BF14C8) FPGA turned out to be the best-suited
FPGA for the job.
A good way of designing the PCIe-Bridge for implementation in the Altera Cyclone IV,
is to use the PCIe IP core with Avalon-ST interface that can be instantiated in Quartus
II. The Avalon-ST interface makes use of the split transaction capability of PCI Ex-
press, making it possible to execute multiple requests in parallel. Executing multiple
requests in parallel enables us to have posted reads in progress on all switch chips while
executing fast reads and writes, and at the same time execute requests through the serial
interface. By making use of the split transaction in the design, we obtained full utiliza-
tion of the PI and SI bandwidth. We made considerable effort to arrive at a structural
design of the PCIe-Bridge, which is both intuitive and modular, and hence straight for-
ward to implement, test and maintain.
Conclusion
84
The designed components of the PCIe-Bridge have been implemented in VHDL and we
took our role as product developers serious by striving to reach the best quality of code,
well documented and tested. We produced approximately 6800 lines of VHDL source
code, which is well suited to be the base of further extensions of the PCIe-Bridge func-
tionality.
Schematics for a PCIe-Bridge add-on board have been made and are ready to be laid
out. The add-on board can be used with the Jaguar and Luton26 reference boards.
We have made a testbench that enables individual and independent testing and verifica-
tion of each module of the PCIe-Bridge. The individual module is tested and verified by
sending TLPs, which make use of the implemented functionality of each module.
The PCIe-Bridge has not been tested on chip, due to the lack of time.
When we look at how far we gotten in this project, we think that we have been success-
ful with selecting a realistic set of requirements for our version on the PCIe-Bridge. We
believe that when the chosen FPGA becomes available, the PCIe-Bridge add-on board
is produced and the design is tested on board, our design is close to a functional product.
7.2 Perspectives
With the developed component, the Vitesse customers can now manage and collect sta-
tistical data from the Vitesse switch chip through the widely used PCI Express interface,
with little extra cost. This gives the Vitesse switch chips an advantage because they can
now be connected to an external CPU either through a parallel interface or through a
PCI Express interface. When the add-on board is produced, it can be sent out to custom-
ers together with the reference systems of the Jaguar and Luton26. This enables the cus-
tomers to select the type of external CPU they like, making the Vitesse switch chips
more favorable in the market.
If Vitesse choose to integrate a PCI Express endpoint in their future switch chips, they
will be able to test the software for external CPU communication with this PCIe-Bridge
add-on board. Furthermore, with the PCIe-Bridge, Vitesse will be able to get an idea of
the pros and cons of interfacing to the switch chips through PCI Express.
Conclusion
85
Another perspective of the PCIe-Bridge is to use it in a backplane switch. A backplane
switch would have a single CPU for controlling several expansion cards, in these expan-
sion cards there could be a PCIe-Bridge and a number of switch chips. A figure of the
setup can be seen in Figure 7-1.
BACKPLANE SWITCH
PC
Ie link
PC
Ie li
nk
CPU
Expansion card
PI
Bridge
FPGA
PI
PCIe link
Expansion card
PI
Bridge
FPGA
PI
Expansion card
PI
Bridge
FPGA
PI
Figure 7-1: Block diagram Backplane switch.
The figure only shows two switch chips on each expansion board, but the PCIe-Bridge
can handle up to three.
Because the PCI Express devices are memory mapped it would make it easy to write the
software for such a Backplane switch and the bandwidth of the PCI Express is easily
scalable to fit the size of the Backplane switch.
7.3 Further work
Before this product is released to customers, it would be a good idea to layout the de-
signed add-on board and test the PCIe-Bridge with the Vitesse switch chip reference
systems.
Conclusion
86
During the implementation work, it became clear that two features would be useful to
have in a next release of the PCIe-Bridge:
Automatic bypass of packets in the TLP-Switch.
Reset functionality of the PCIe-Bridge through the PCI Express interface.
The automatic bypass should reject packets that are stuck in the TLP-Switch, because
one of the interface modules malfunctions. This gives way for a reset packet to the Con-
trol-module.
87
REFERENCES
[1] http://zone.ni.com/devzone/cda/tut/p/id/3767, PCI Express – An Overview of
the PCI Express Standard, National Instruments, 2009
[2] http://www.acm.uiuc.edu/sigops/roll_your_own/7.c.0.html, Config header for-
mat
[3] Ravi Budruk, Ron Anderson and Tom Shanley, PCI Express System Architec-
ture, MindShare, 2004
[4] http://www.altera.com/literature/ug/ug_pci_express.pdf, Altera PCI Express
Complier User Guide, Altera, 2010
[5] Personal Communication, Thomas K. Jørgensen, Vitesse, March 2010
[6] http://www.altera.com/literature/an/an486.pdf, Serial peripheral interface in
Max II CPLDs, Altera, 2007
[7] Personal Communication, Jørgen Abrahamsen, Vitesse, April 2010
[8] http://www.altera.com/corporate/news_room/releases/2010/products/nr-
civships.html, Altera Rolls Out Production Shipments of Low-Cost, Low-Power
Cyclone IV FPGAs, Altera, 2010
[9] http://www.altera.com/products/devices/serialcfg/overview/scg-overview.html,
Serial Configuration devices overview, Altera
[10] Personal Communication, Martin Galster, Vitesse, April 2010