pcie to parallel interface bridge in low-cost fpgaetd.dtu.dk/thesis/264828/bac10_33.pdf · pcie to...

Rasmus Bo Sørensen, s072080

Jaspur Højgaard, s072069

PCIe to Parallel Interface

bridge in low-cost FPGA

Vol. 1

Bachelor 's Thesis, June 2010

2

PCIe to Parallel Interface bridge in low-cost FPGA, Vol. 1

Author(s): Rasmus Bo Sørensen, s072080 Jaspur Højgaard, s072069 Supervisor(s): Docent, Jens Sparsø, DTU Informatics Senior Hardware Manager, Thomas K. Jørgensen, Vitesse Semiconductor Corporation

Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kgs. Lyngby, Denmark Phone +45 45 25 33 51, Fax: +45 45 88 26 73 [email protected] www.imm.dtu.dk IMM-B.Sc.-2010-33

Release date:

25. june 2010

Class:

1 (public)

Edition:

1. edition

Comments:

This report is a part of the requirements to achieve Bachelor of Science in Engineering (BSc) at Technical University of Den-mark.

The report represents 20 ECTS points.

Rights:

© Vitesse Semiconductor Cooperation, 2010

http://www.imm.dtu.dk/

3

ABSTRACT

The objective of this project has been to design a bridge between a PCI Express inter-

face on one side and a parallel interface on a Vitesse switch chip on the other side. In

addition, we found it appropriate to enhance the bridge with a serial interface for com-

munication to peripheral components. The bridge should be designed for implementa-

tion in a low-cost FPGA.

We have developed a bridge for communication between a PCI Express interface on

one side and a parallel interface and a serial interface on the other side. The developed

bridge has the capability of interfacing against 3 different parallel interfaces and 4 dif-

ferent serial interfaces. Given the requirements of the design, we have chosen the best

suited FPGA for the implementation of the bridge, a EP4CGX15BF14C8 from Altera.

The implemented design has been tested through simulation using ModelSim.

The schematic for a PCB reference board has been made. Such an interface board can

be used for testing the bridge against Vitesse switch reference systems.

By making use of the split transaction capability of the PCI Express interface, we avoid

that the serial and parallel interfaces restrict each other, with regards to their bandwidth.

5

RESUMÉ

Målsætningen med dette projekt har været at designe en bro mellem et PCI Express in-

terface på den ene side og et parallel interface, på en Vitesse switch chip, på den anden

side. I tillæg har vi fundet det passende at udbygge broen med et serielt interface til

kommunikation med omkringlæggende komponenter. Broen skal designes til at imple-

menteres i en lav-pris FPGA.

Vi har udviklet en bro til kommunikation mellem et PCI Express interface på den ene

side og et parallel interface og et serielt interface på den anden side. Den udviklede bro

har mulighed for at kommunikere med 3 forskellige parallel interface og 4 forskellige

serielle interface. Med de givne krav til designet, har vi valgt den bedste FPGA til im-

plementeringen af broen, en EP4CGX15BF14C8 fra Altera. Det implementeret design

er blevet tested gennem simulation med ModelSim. Vi har endvidre lavet et diagram til

et PCB reference print. Et sådan interface print kan bruges til at teste broen sammen

med Vitesse switch reference systemer. Ved at gøre brug a split transaction muligheden,

der er indbygget i PCI Express interfacet, undgår vi at de serielle og parallelle interface

begrænser hinanden med hensyn til deres båndbredde.

7

PREFACE

This report is the result of a bachelor project from the Technical University of Denmark,

in cooperation with Vitesse Semiconductor Corporation. During the project, we have

had supervisors from both places. We have been two students working on this project

and we both wanted to carry out our bachelor project in cooperation with a company, to

get an experience outside the university and to get a feeling of a real life engineering

problem. Between the possible project descriptions we received from Vitesse, we chose

this project because we thought it was an interesting combination between designing a

digital system and fitting it to the given standards. The project description from Vitesse

is seen below.

PCIe-to parallel interface bridge in low-cost FPGA

This project is to develop a PCIe brige for the Vitesse Ethernet Switches and MACs,

so that they can be connected to CPU systems that have a PCI Express interface.

The project can also include design and layout of a small interface board that holds

the FPGA.

The project should contain the following activities:

Data gathering - understanding the requirements for the PCIe bridge

Write up a requirement specification

Research the market for small FPGA to find the lowest cost device that meets

the requirements

Development of the VHDL/VERILOG code for the device

Verify the code on a testbench

Test and debug of the system

Write report and application note

As a part of doing a commercial project with results that can be delivered to customers,

the documentation is especially important. As part of this project we have made an ap-

plication note from a Vitesse template and the application note is included in the appen-

dix. The application note is an important part of doing this project because it contains

8

details that we have experienced and are required to make the device work properly.

Because Vitesse is an American company, the template is of course in letter format and

may look a bit odd in A4 format.

While doing this project, we have had an office at Vitesse in Herlev, where we have

been working most of the time. The peripheral components relative to our project, have

been under development during the time of our project and so have their datasheets.

To get the information we have needed, we have spoken directly to the developers of

the chips and their reference systems. During our stay at Vitesse we have been partici-

pating in the weekly meetings in our department. These meetings have helped us getting

much of the information we have needed. When we spoke to the developers we got all

the information we needed, but it has complicated the documentation because of the

lack of references. Much of the documentation we have is in form of handwritten notes

on post-its.

During the time we have been at Vitesse, we have participated in the DSE messe at

DTU as Vitesse representatives, and held a presentation about our project and how it

was to carry out a project at Vitesse, for students from Ingeniør Højskolen København.

We have been two students working on this project. It is specified in the headings of

each section who has written what, with our initials.

Rasmus Bo Sørensen – (RBS)

Jaspur Højgaard – (JH)

The initials specified in a heading apply to the section of the heading and all of the sub-

section.

In this report, references will be written en the following format:

[Section name, Reference number] or [Reference number].

In the submitted report, a CD with the source code, the test files, the schematics and the

report itself has been included.

It has been a great experience to work with the engineers at Vitesse. We have gotten a

lot of helpful input to our project, and therefore we would like to thank our supervisor

Thomas K. Jørgensen and the rest of the hardware group for their help.

9

TABLE OF CONTENT

Vol. 1

Abstract............................................................................................................................ 3

Resumé ............................................................................................................................. 5

Preface.............................................................................................................................. 7

List of figures ................................................................................................................. 13

List of tables .................................................................................................................. 15

Abbreviations ................................................................................................................ 17

1 Introduction ........................................................................................................... 19

1.1 Problem background ........................................................................................ 19

1.2 Problem statement ............................................................................................ 21

1.3 Target audience ................................................................................................ 21

1.4 The report structure .......................................................................................... 22

2 Interfaces of the PCIe-Bridge .............................................................................. 23

2.1 PCI Express (JH) .............................................................................................. 24

2.1.1 PCI Express components .......................................................................... 25

2.1.2 Transaction Layer Packets ........................................................................ 25

2.1.3 Interrupts ................................................................................................... 28

2.1.4 Virtual channels ........................................................................................ 28

2.2 Parallel interface (RBS) ................................................................................... 29

2.2.1 Posted Reads ............................................................................................. 30

2.2.2 Interrupts ................................................................................................... 32

2.3 Serial interface (JH) ......................................................................................... 32

2.4 General purpose input / output pins (RBS) ...................................................... 34

2.5 Requirements and usage of interfaces (RBS) .................................................. 35

3 Selecting an FPGA ................................................................................................ 37

3.1 FPGA requirements (JH).................................................................................. 37

3.1.1 PCI Express Intellectual Property ............................................................. 37

3.1.2 IO Pins ...................................................................................................... 38

3.2 Overview of possible FPGAs (JH) ................................................................... 39

Table of content

10

3.3 About the chosen FPGA (JH) ........................................................................... 40

4 Design of the PCIe-Bridge .................................................................................... 41

4.1 Structural design (RBS) .................................................................................... 41

4.2 Components of the structural design ................................................................ 43

4.2.1 PCIe-module (JH)...................................................................................... 43

4.2.2 TLP-Switch (RBS) .................................................................................... 44

4.2.3 TLP-Multiplex (RBS)................................................................................ 45

4.2.4 PI-module (RBS) ....................................................................................... 49

4.2.5 PI-IRQ-module (RBS) ............................................................................... 55

4.2.6 SI-module (JH) .......................................................................................... 55

4.2.7 Control-module (RBS) .............................................................................. 60

4.2.8 Unsup-module (JH) ................................................................................... 61

4.3 Address space of the PCIe-Bridge (JH) ............................................................ 62

4.4 Expected performance (RBS) ........................................................................... 63

5 Implementation of the PCIe-Bridge .................................................................... 65

5.1 Implementation of HDL (RBS) ........................................................................ 65

5.2 Design of interface board (JH) ......................................................................... 66

5.2.1 Components ............................................................................................... 67

5.2.2 Schematics ................................................................................................. 68

5.2.3 Layout ........................................................................................................ 71

6 Testing and verification ........................................................................................ 73

6.1 Simulation (RBS) ............................................................................................. 73

6.1.1 Structure of testbench ................................................................................ 73

6.1.2 Methodology of testing ............................................................................. 74

6.1.3 Verifying packet routing ........................................................................... 75

6.1.4 Verifying functionality of modules ........................................................... 75

6.2 Synthesis ........................................................................................................... 79

6.2.1 Hardware usage (JH) ................................................................................. 79

6.2.2 Meeting the timing constraints (RBS) ....................................................... 81

7 Conclusion .............................................................................................................. 83

7.1 Results .............................................................................................................. 83

7.2 Perspectives ...................................................................................................... 84

7.3 Further work ..................................................................................................... 85

References ...................................................................................................................... 87

Vol. 2 (Confidential)

A Application note 7

Table of content

11

B PCB schematics 17

C Test files 29

D Source code 45

E Datasheet extracts 163

F PCI Express extract 173

13

LIST OF FIGURES

Figure 1-1: External CPU interfacing to Vitesse switch chip, through parallel

interface. ................................................................................................................ 19

Figure 1-2: External CPU interfacing to Vitesse switch chip, through PCI

Express interface. .................................................................................................. 20

Figure 1-3: External CPU interfacing to multiple Vitesse switch chips, through

PCI Express interface. ........................................................................................... 20

Figure 2-1: Overview of the interfaces on the PCIe-Bridge. ......................................... 23

Figure 2-2: The structure of the PCIe link, between the PCIe-Bridge and the

External CPU, showing the layer of the PCI Express interface. ........................... 24

Figure 2-3: Serial order of Transaction Layer Packets. ................................................. 25

Figure 2-4: Timing diagram of a read access through the parallel interface. ................ 30

Figure 2-5: Timing diagram of a write access through the parallel interface. ............... 30

Figure 2-6: Timing of accesses when doing a posted read from an 16-bit

parallel interface. ................................................................................................... 31

Figure 2-7: The bandwidth of a 16-bit parallel interface on the Vitesse chips,

as a function of posted reads. ................................................................................ 31

Figure 2-8: Timing diagram of a write sequence to the Luton26 chip, through

the serial interface. ................................................................................................ 33

Figure 2-9: Timing diagram of a read sequence using Luton26 serial interface. .......... 33

Figure 4-1: Block diagram of the structural design of the PCIe-Bridge........................ 42

Figure 4-2: Timing diagram of Avalon-ST interface. ................................................... 43

Figure 4-3: State diagram of the TLP-Switch................................................................ 44

Figure 4-4: Timing diagram of altered Avalon-ST interface used in the TLP-

Multiplex. .............................................................................................................. 46

Figure 4-5: Timing diagram of altered Avalon-ST, when multiple

StartOfPacket signals are high. ............................................................................. 47

Figure 4-6: State diagram of state machine in the TLP-Multiplex. ............................... 48

14

Figure 4-7: The maximum bandwidth of the shared-bus when connected to 3

16-bit parallel interfaces. ....................................................................................... 50

Figure 4-8: Flowchart for executing requests in the control state machine of

the parallel interface. .............................................................................................. 51

Figure 4-9: Block diagram of the PI-module. ................................................................ 52

Figure 4-10: State diagram of the PI ctrl state machine. ................................................ 53

Figure 4-11: State diagram of the PI interface state machine. ....................................... 54

Figure 4-12: State diagram of the PI completion state machine. ................................... 54

Figure 4-13: Block diagram of the SI-module. .............................................................. 56

Figure 4-14: State diagram of SI FSM Input.................................................................. 57

Figure 4-15: State diagram for SI FSM control. ............................................................ 57

Figure 4-16: State diagram of interface state machine in the SI-module. ...................... 58

Figure 4-17: State diagram of output state machine in the SI-module........................... 59

Figure 4-18: State machine of the Control-module. ....................................................... 61

Figure 4-19: State diagram for unsupported TLP module. ............................................ 62

Figure 5-1: Structural block diagram of PCIe-Bridge for testing routing of

packets.................................................................................................................... 65

Figure 5-2: Overview of components on the PCIe-Bridge interface board ................... 68

Figure 5-3: An overview on how the components will be placed on the PCB. ............. 71

Figure 6-1: Block diagram of testbench. ........................................................................ 74

Figure 6-2: StartOfPacket signals of TLP-Switch, while testing routing of

packets.................................................................................................................... 75

Figure 6-3: Timing diagram of interrupt signals, when testing PI-IRQ-module. .......... 76

Figure 6-4: Measured bandwidth of the implemented PI-module, as a function

of posted reads. When the all the 3 PIs are 16-bit wide. ........................................ 78

Figure 7-1: Block diagram Backplane switch. ............................................................... 85

15

LIST OF TABLES

Table 2-1: Possible Transaction Layer Packet requests and completions. .................... 26

Table 2-2: Transaction Layer Packet header for Memory requests, 32-bit

address. .................................................................................................................. 26

Table 2-3: Fields and structure of a completion TLP header......................................... 27

Table 2-4: Fields and structure of a message TLP header. ............................................ 28

Table 2-5: The signals and their description of the parallel interface of the

Vitesse switch chips. ............................................................................................. 29

Table 2-6: Width of the address and data signals, for the MPLS, Jaguar and

Luton26, Vitesse switch chips. .............................................................................. 29

Table 2-7: Description of the signals of the SPI bus. .................................................... 32

Table 2-8: Description of required general purpose input/output pins for the

PCIe-Bridge. .......................................................................................................... 34

Table 2-9: Table of requirements of the PCIe-Bridge and its interfaces. ...................... 35

Table 2-10: Required number of interfaces in PCIe-Bridge. ......................................... 35

Table 3-1: List of possible IP solutions. ........................................................................ 38

Table 3-2: The number of IO pins used by the different interfaces. .............................. 38

Table 3-3: The number of IO pins needed in the FPGA, to implement the

desired interfaces. .................................................................................................. 39

Table 3-4: Price of possible FPGA solutions. ................................................................ 39

Table 3-5: Attributes of chosen Cyclone IV GX FPGA ................................................ 40

Table 4-1: The number of output ports on the TLP-Switch and a description of

which requests are send to which output port. ...................................................... 45

Table 4-2: Priorities of input ports of the TLP-Multiplex. ............................................ 48

Table 4-3: Maximum bandwidth for SI reads and writes .............................................. 55

Table 4-4: Address space of the Control-module. ......................................................... 60

Table 4-5: Address space of the PCIe-Bridge................................................................ 63

Table 5-1: Filenames of the implemented modules. ...................................................... 66

16

Table 5-2: The supply voltage for the PCIe-Bridge interface board, for each of

the reference boards. .............................................................................................. 70

Table 5-3: Needed supply voltages by components ....................................................... 70

Table 6-1: Testable functionality of modules in the PCIe-Bridge. ................................ 76

Table 6-2: Results of PI-IRQ-module test. ..................................................................... 77

Table 6-3: Total hardware usage of the PCIe-Bridge design on FPGA ......................... 79

Table 6-4: Expected memory bit and register usage of implemented FIFOs ................. 80

Table 6-5: Expected LC register usage for the PCIe-Bridge design .............................. 81

17

ABBREVIATIONS

CPU Central Processing Unit

DW Double Word (32 bits)

FIFO First In First Out

FPGA Field Programmable Gate Array

GPIO General Purpose Input Output

HDL Hardware Description Language

IO Input Output

IP Intellectual Property

MPLS Multiprotocol Label Switching

PCB Printed Circuit Board

PCI Peripheral Component Interconnect

PCI-SIG PCI-Special Interest Group

PCIe PCI Express

PHY Physical layer

PI Parallel Interface

QW Quadruple Word (64 bits)

SI Serial Interface

SPI Serial Peripheral Interconnect

TLP Transaction Layer Packet

VHDL VHSIC Hardware Description Language

VHSIC Very High Speed Integrated Circuit

19

1 INTRODUCTION

1.1 Problem background

The Vitesse Semiconductor Corporation makes Ethernet switch chipsets that their Cus-

tomers integrate into their Ethernet switch products. Customers using these Vitesse

chipsets might require an external CPU system for performing management tasks, such

as collection of statistical data and configuration of the switch. In order to meet this re-

quirement, some of the Vitesse switch chips have a parallel interface, for external CPU

communication as outlined in Figure 1-1.

External

CPUParallel Interface

Figure 1-1: External CPU interfacing to Vitesse switch chip, through parallel interface.

With the introduction of PCI Express, customers now requests the ability to access Vi-

tesse switch chips from external CPU systems by taking advantage of this flexible and

widely supported standard. Adding a PCI Express interface module to the existing Vi-

tesse switch chips is not a feasible option, and hence building a “PCI Express to Parallel

Interface” bridge (PCIe-Bridge) becomes relevant. The PCIe-Bridge should not limit the

operation between the External CPU and the switch chip, neither with respect to func-

tionality nor to speed. The price of the PCIe-Bridge will be in addition to the price of

the switch chip. Therefore, the component price of the PCIe-Bridge must be low relative

to the price of the switch chip. A solution is to use a low-cost FPGA for the implemen-

tation of the PCIe-Bridge. This setup is outlined in Figure 1-2.

Introduction

20

External

CPUParallel Interface PCI ExpressBridge

FPGA

Figure 1-2: External CPU interfacing to Vitesse switch chip, through PCI Express in-

terface.

Vitesse has two newly developed switch chips, the Luton26 and the Jaguar. The Jaguar

chip is a 24x1G + 4x10G Carrier Ethernet Switch, the Luton26 is a 26x1G Ethernet

Switch with 12 integrated 1000BASE-T PHYs. These two switch chips both have a pa-

rallel interface for onboard external communication. The two switch chips also have a

serial interface for booting of their internal CPU or communication to peripheral com-

ponents.

For the Jaguar, it is possible to combine two of the same switch chips into one switch

with twice as many ports. The PCIe-Bridge shall support multiple interfaces of these

brand new types of switches. Furthermore, the configurations with the Jaguar switch

chips should support an additional MPLS module, this configuration is seen in Figure

1-3. The parallel interface on the MPLS module is very similar to the parallel interface

in the Jaguar

External

CPUParallel Interface PCI ExpressBridge

FPGA

MPLS module

Parallel Interface

Par

allel Int

erfa

ce

Figure 1-3: External CPU interfacing to multiple Vitesse switch chips, through PCI

Express interface.

Introduction

21

1.2 Problem statement

This project aims to design a PCIe-Bridge between PCI Express and the parallel inter-

faces, featured on the Vitesse Luton26 and Jaguar switches. The PCIe-Bridge should be

fast enough to ensure that it is the parallel interface which is restricting the speed of

communication. In some configurations, there is a need to interface to multiple parallel

interfaces through the bridge. Furthermore, the bridge must be implemented in a low-

cost FPGA. Now that we are implementing this PCIe-Bridge it would be favorable to

have some additional functionality. This additional functionality could be a serial inter-

face, and a number of general-purpose input/output pins, for controlling the operation

state of the switch chips.

A solution to the described problem will be found by answering the following ques-

tions:

Which requirements are there to the PCIe-Bridge?

Which type of FPGA meets the requirements of the PCIe-Bridge best?

What is a good design of the PCIe-Bridge?

How can the PCIe-Bridge be implemented?

How can the implementation of the PCIe-Bridge be tested and verified?

To state the requirements to the bridge, we will investigate the protocols of the different

interfaces mentioned in the problem statement. Based on the stated requirements and the

knowledge obtained from this investigation, we will find the lowest-cost FPGA that

meets the requirements. With the physical restrictions at hand, we will examine how the

functional design of the bridge can meet the requirements. Knowing the design and the

type of FPGA, we can then determine how to implement the bridge in a real life confi-

guration. With the implementation and the design in mind, the bridge can be tested and

verified.

1.3 Target audience

The target audience for this report will be:

The auditors of our Bachelor thesis.

Developers at Vitesse who wants to use and maintain our PCIe-Bridge design to

connect an external CPU to a Vitesse switch chip through PCI Express.

Other developers interested in designing FPGA systems connected through PCI

Express to other digital components.

Customers who wants to integrate the PCIe-Bridge should read the application

note.

Introduction

22

1.4 The report structure

The main part of this report is divided into the following 5 chapters:

Interfaces of the PCIe-Bridge – Here we explain the basic theory behind the interfaces

used in this project, and how they can be combined. We also look at the different re-

quirements of the interfaces.

Selecting an FPGA – We list the arguments for the different types of FPGA to use in

this project and how they will affect the design of the PCIe-Bridge. On this basis we

choose which type of FPGA will suit this project the best.

Design of the PCIe-Bridge – In this chapter, we discuss the design considerations for

each of the modules in the PCIe-Bridge, and describe the design we are implementing.

Implementation of the PCIe-Bridge – We explain the process of implementing the

PCIe-Bridge.

Testing and verification – We look at how the PCIe-Bridge can be tested, and we seek

to verify that the implementation is working properly.

23

2 INTERFACES OF THE PCIE-BRIDGE

To understand the requirements of the PCIe-Bridge, there is a need to understand the

different interfaces that the PCIe-Bridge will connect to. In this chapter we provide an

overview of the interface protocols used in the PCIe-Bridge. An overview of the inter-

faces can be seen in Figure 2-1.

This will be the following interfaces:

PCI Express, for communication with an external CPU.

Parallel Interface, for communication with the Vitesse switch chips.

Serial Interface, for communication with the Vitesse switch chips.

General Purpose Input/Output pins, for controlling the operation state of the Vi-

tesse switch chips.

External

CPUPCIe LinkPCIe_Bridge

FPGA

Parallel interface

Par

allel int

erfa

ce

Parallel interface

Serial interface

Ser

ial int

erfa

ce

Serial interface

GPIO

GPIO

GPIO

Co

nn

ecto

rs

Figure 2-1: Overview of the interfaces on the PCIe-Bridge.

Interfaces of the PCIe-Bridge

24

2.1 PCI Express (JH)

The Peripheral Component Interconnect Express bus, abbreviated PCIe, was introduced

in 2004. It has replaced the older PCI bus and other internal chip interconnects. The

PCIe protocol is structured around serial component-to-component Links. A PCIe Link,

consists of differentially driven signal pairs, divided into transmit pairs and receive

pairs. One such set of signals, with one pair for transmit, Tx in Figure 2-2, and one for

receive, Rx in Figure 2-2, is called a PCIe lane (Appendix F.1). A PCIe Link can con-

sist of a multiple of 1, 2, 4, 8, 16 or 32 lanes, to increase the bandwidth of the Link. For

the first generation of PCIe technology, the effective bandwidth is 2.5 Giga-

bits/second/lane/direction. The nominal bandwidth in each direction pr. lane will be

250 MB/s because first generation PCIe utilizes 8b/10b encoding. Taking overhead into

account 200 MB/s are usable by components for data transfer [PCI Express Architec-

ture, 1].

The PCIe architecture consists of the three discrete logical layers: The Transaction

Layer, the Data-Link Layer and the Physical Layer (PHY).

Transaction Layer: Is the top-level layer. It is responsible for assembling and dis-

assembling packets, sent or received over the PCIe Link.

Data-Link-Layer: Is the middle layer. It functions as a stage between the Transac-

tion layer and the Physical layer. This includes Link management and data integrity,

with error detection and correction.

Physical Layer: Is the lowest level layer. It includes the circuitry for interface op-

eration. This includes: driver and input buffer, parallel-to-serial and serial-to-

parallel conversion, one or multiple PLLs and impedance matching circuitry.

Communicating between, a PCIe link in this case between an external CPU and the

PCIe-Bridge can be seen in Figure 2-2.

External CPU systemPCIe-Bridge

App.

layer

Physical

layer

Trans-

action

layer

Data-

link-layer

Physical

layer

Trans-

action

layer

Data-

link-layer

Trans-

action

layer

Data-

link-layer

Physical

layer

Trans-

action

layer

Data-

link-layer

Physical

layer

External

CPU

Rx Tx

Tx Rx

Figure 2-2: The structure of the PCIe link, between the PCIe-Bridge and the External

CPU, showing the layer of the PCI Express interface.

The three layers are split into two halves, one handling incoming data traffic and one

handling outgoing data traffic. On top of the Transaction Layer will be an application

layer, implemented in the PCIe-Bridge.


25

2.1.1 PCI Express components

Components communicating through a PCIe Link can either function as a root port or

an endpoint devices. The root port device is connected to the root complex of the PCI

Express hierarchy. The root port maps a portion of a PCI Express hierarchy to end-

points, PCI Express-PCI bridges or PCI Express fabric switches (Appendix F.1). A PCI

Express endpoint can be a requester, completer or both on its own account, or through

another non-PCIe component. The endpoint function is what is needed for the PCIe-

Bridge. Therefore the root port will not be investigated further.

A PCIe endpoint is identified by the four values:

Vendor ID: An ID unique for each vendor, and is assigned by PCI-Special Interest

Group (PCI-SIG). This is reserved for the manufacturer of the PCIe component.

Device ID: Is a unique ID given to the PCIe component by the vendor.

Revision ID: Is a value stating the revision of the device.

Class Code: Is a 24-bit value specifying what type of device it is.

When an endpoint is connected to a root port, the Base Area Registers (BARs) specify

the address space needed for the specific PCIe component. The values are found in the

PCIe endpoint configuration space registers [2].

2.1.2 Transaction Layer Packets

The PCI Express interface is packet based, these packet are called Transaction Layer

Packets (TLPs). The TLPs consist of a header, some data payload, if applicable, and an

optional TLP digest. For more information on the TLP digest see Appendix F.3. The

packets are sent serially over the PCIe link, in the order shown in Figure 2-3.

TLP HeaderData Payload

(if applicable)

TLP digest

(Optional)

Figure 2-3: Serial order of Transaction Layer Packets.

The Data Payload does not have a fixed size. The amount of data to be written or read is

set in the header of the TLP. When writing or reading more than 32-bits in one request,

the request is denoted burst-request. The PCIe-Bridge does not support any TLP digest,

and this will not be explored any further.

The Transaction Layer provides four address spaces. The three PCI address spaces:

Memory, I/O and configuration. The fourth address space is a message space [Transac-

tion layer, 1]. To each of the PCI address spaces, the requests read and write can are

possible. A list of possible TLPs is found in Table 2-1


26

Table 2-1: Possible Transaction Layer Packet requests and completions.

Type of TLP Functional description

Memory read/write Reads/writes data from/to memory

I/O read/write Reads/writes data from/to I/O

Configuration read/write Reads/writes data from/to Config. Reg.

Message Sends message. Is also used for interrupt

Completion with/without data Sends completion to a request

There are two main categories of TLPs: requests and completions. The Memory write

request is the only request that does not require a completion. For a read request, the

data read will be returned as data payload in the completion TLP.

The format of the header of the TLPs varies with type of request. A header for a Memo-

ry request is seen in Table 2-2.

Table 2-2: Transaction Layer Packet header for Memory requests, 32-bit address.

Byte + 0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

0 R Fmt Type R TC R T

D

E

P Attr R Length

4 Requester ID Tag Last BE First BE

8 Addr[31:2] R

In this case the address has a length of 32-bits. Many of the new CPU-systems run with

a 64-bit addressing. When 64-bit addressing, the bytes 8-11 in Table 2-2 are placed in

bytes 12-15 instead and the bytes and bytes 8-11 of the header become Addr[63:32]

instead.

The header of a request has a length of either 3 or 4 Double Words (DW), where a DW

is 32-bits. The header consists of different values. The R values are reserved and set

low. The other values in Table 2-2 and their functions are listed below

Addr: The address field that contains the starting address of the request. The ad-

dress is bound to be DW aligned while the last two bits are reserved.

Fmt: In this field, it is set whether the request is a with or without data, and whether

it is a 32-bit or 64-bit address used.

Type: This value determines what type the request is.

TC: Is the traffic class field. Sets what service class the request is in (Appendix

F.3). It ultimately determines the relative priority of the PCI Express transaction [3].

TD: Bit asserted if there is a TLP digest. For the PCIe-Bridge, this bit must be low.

If a TLP with a TLP digest is received, the PCIe-Bridge will malfunction.


27

EP: The EP bit is set if the Transaction Layer has detected an error in the TLP. This

is a non-fatal error [Error Handling, 4], and therefore this bit will not be taken into

account in the PCIe-Bridge.

Attr: Is used to provide additional information that allows modification of the de-

fault handling of Transactions (Appendix F.3).

Length: The Length value is the number DW to be read or written. This value is

used in burst-read and burst-write requests.

Requester ID: Every device, communicating through PCI Express, is given a de-

vice ID. This device ID is static and is set to the Requester ID value.

Tag: The tag is given to each transaction so that Requester ID and tag form a

unique Transaction ID.

Last BE and First BE: Byte enables that are used to qualify bytes of interest in the

first and last DW transferred. This allows offsetting the address from the DW boun-

daries, and also allows transfers smaller than one DW [3].

The Length field is 10-bits, this means that the maximum payload of a TLP can be up to

1024 DW. The maximum payload can be restricted in a specific endpoint device.

PCI Express supports split transaction. As mentioned for the Tag, the Requester ID and

Tag field form a Transaction ID. This Transaction ID is reserved as long as a comple-

tion has not been sent. This means that as long as there is a unique Transaction ID avail-

able, a request can be sent. Each PCIe component specifies how many Tags are sup-

ported, and thereby, what the maximum number of unresolved requests.

A request requires a completion and to identify which request the completion is for, the

Transaction ID is included in the Completion TLP header as seen in Table 2-3.

Table 2-3: Fields and structure of a completion TLP header.

Byte + 0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0


D

E

P Attr R Length

4 Completer ID Completion

status

B

C

M

Byte count

8 Requester ID Tag R Lower address

A Completion header is always 3 DW, contrary to that of the memory request headers.

For a completion header there are some values that were not present in the Message

request header. The new values, and their functions are described below.

Completer ID: Completer ID is a value set together of three separate fields: Bus

number, Device number and Function number. It is unique for every PCI Express

function and may well change during runtime.


28

Completion status: Value telling the status of a request i.e. if request was success-

ful or not.

BCM: Is a field for the PCI-X standard and must not be set by PCI Express com-

pleters.

Byte count: If the completion to a read request is split into multiple completion

packets, this field shows how many bytes remain to be read (Appendix F.3). For a

memory write, the byte count is always set to 4.

Lower address: This field contains bits 2 to 8 from the Addr field of the request.

If there is any data to return, it will be sent directly after the header.

2.1.3 Interrupts

Interrupts over a PCIe Link can be handled in different ways. One way, is sending an

interrupt using a message TLP. The way this is done is by using the INTx virtual wire

signaling mechanism. The INTx interrupt mechanism uses message TLPs to signal an

interrupt. The header of a message TLP is seen in Table 2-4

Table 2-4: Fields and structure of a message TLP header.

Byte + 0 +1 +2 +3

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0


D

E

P Attr R Length

4 Requester ID Tag Message Code

8 R

12

The header looks much like that of a Memory request TLP. The Message Code field is

the only unique field and is used to indicate what kind of message the TLP is. The INTx

mechanism has eight distinct messages. The eight interrupt messages are split into two

categories: Assert and Deassert. The INTx implements four virtual wires for each end-

point, that each have an assert and deassert message associated.

2.1.4 Virtual channels

Virtual Channels (VC) can be used together with the traffic class (TC) field to route and

prioritize packets, when sent over the PCI Express fabric. A PCI Express Link may con-

sist of multiple virtual channels. Devices may support up to eight virtual channels, and

they are weighed when routed through the PCI Express fabric such that virtual channel

number 0 (VC0) gets the best service and VC7 the least.

Traffic classes are mapped to virtual channels, such that multiple TCs can be mapped to

each TC, while multiple VCs cannot be mapped to one TC. VC0 and TC0 (TC =

“000”), are always mapped together. The TC1 class can be mapped to either VC1 or

VC0, and TC2 can only be mapped to VC2, VC1 and VC0 and so on. For a detailed

description on virtual channels, go to (Appendix F.4).


29

2.2 Parallel interface (RBS)

The parallel interface (PI) on the Vitesse switch chip can operate in master mode or in

slave mode. In master mode, the switch chip can access peripheral components through

the PI. In slave mode, an external device i.e. a CPU can access the internal registers in

the switch chip. When the external CPU is connected to the PI of one or multiple Vi-

tesse switch chips through the PCIe-Bridge, the PCIe-Bridge will always be master of

the PI. The PI signals and timing diagrams of the Luton28 is found in the pages 226-228

of Appendix E.1 and repeated in Table 2-5, Figure 2-4 and Figure 2-5. The PI signals

and timing diagram of the Luton28 are similar to the Jaguar and the Luton26. The refer-

ence is to the Luton28 datasheet because the datasheets of the Jaguar and Luton26 are

not finished.

Table 2-5: The signals and their description of the parallel interface of the Vitesse

switch chips.

Signals Type Description

PI_Addr[ ] I Address of the register

PI_Data[ ] I/O The data to be written or read

PI_RnW I Read/write control pin

PI_nOE I Enables the data output of the PI

PI_nCS I Chip select, for selecting the specific chip

PI_nDone O Acknowledge control pin

PI_IRQ[ ] O Interrupt signal

The supported types of switch chips have different widths of the signals PI_Addr and

PI_Data. The width of these signals can be seen in Table 2-6 [5].

Table 2-6: Width of the address and data signals, for the MPLS, Jaguar and Luton26,

Vitesse switch chips.

Chip type Signal Width [# Bits]

Jaguar / MPLS PI_Addr 24

PI_Data 16

Luton26 PI_Addr 4

PI_Data 8

The parallel interface of the MPLS module is similar to the parallel interface of the Ja-

guar. The way of interfacing to the MPLS module and the Jaguar switch chip is identic-

al only the addresses of the registers vary.

The internal registers in the MPLS, Jaguar and the Luton26 switch chips are 32 bit

wide. To read a 32-bit register through the 16-bit data interface, 2 consecutive reads

with incrementing addresses are required, and 4 consecutive reads with incrementing

addresses through the 8-bit data interface.


30

The timing diagram of a read through the PI is seen in Figure 2-4.

PI_ADDR

PI_Data

PI_nWR

PI_nOE

PI_nCS

PI_nDone

Addr

Data

Figure 2-4: Timing diagram of a read access through the parallel interface.

The address and the output enable signals of the PI have a setup time of 4 ns, after the

setup time the chip-select signal can be set low. When the data of the read access is va-

lid the nDone signal goes low, and stays low until the chip-select goes high again.

The timing diagram of a write through the PI in seen in Figure 2-5.

PI_ADDR

PI_Data

PI_nWR

PI_nOE

PI_nCS

PI_nDone

Addr

Data

Figure 2-5: Timing diagram of a write access through the parallel interface.

The address and data signals of the PI have the setup time of 4 ns, after the setup time

the chips-select signal can be set low. The nDone signal goes low when the data is writ-

ten. For the précis timing specifications, look in Appendix E.2.

2.2.1 Posted Reads

The internal registers of the two switch chips and the MPLS co-processor consist of

normal registers and fast registers, which have different access times. Because the data-

sheet of the 3 chips are not finished yet, we do not have the addresses of the different

registers. The access time of a normal register is 470 ns, and the access time of a fast

register is up to 65 ns. This difference in access time have no effect when registers are

written to, all writes take up to 65 ns. To increase the bandwidth when reading from

normal registers, the execution of such a read can be split into three operations, a so

called posted read. The timing of the posted read can be seen in Figure 2-6.


31

The three operation of a posted read are:

A single read from the normal register.

Wait for minimum 470 ns.

Read a number of consecutive reads from the fast register.

A posted read is initialized by reading from the address of the normal registers. The data

in the normal register is then transferred to a predefined fast register, in the Vitesse do-

cumentation referred to as the SLOWDATA register. After 470 ns the fast register can

be read like any other fast register (Appendix E.1). There can only be one posted read in

progress at the time in each switch chip.

470 ns

Read from Addr

65 ns 2*65 ns = 130 ns

Read from SLOWDATAAccesses from other registers

Figure 2-6: Timing of accesses when doing a posted read from an 16-bit parallel inter-

face.

With the ability of splitting a read from a normal register up into two, we get the oppor-

tunity to execute reads from fast registers and writes in the access time of a normal reg-

ister. This increases the maximum bandwidth, depending on the distribution of reads

from normal registers. The maximum bandwidth of the PI, with and without the use of

posted reads can be seen in Figure 2-7. The maximum bandwidth is calculated by di-

viding the number of byte by the time it takes to execute the respective proportion of

posted reads and non-posted reads, when the non-posted reads are executed in the wait-

ing time as far as possible.

Figure 2-7: The bandwidth of a 16-bit parallel interface on the Vitesse chips, as a func-

tion of posted reads.


32

The posted reads graph in Figure 2-7 shows an increase of bandwidth between 0 % and

100 % posted reads, the increase of bandwidth is almost 100 % at 20 % posted reads. It

is favorable for the PCIe-Bridge to have the ability of executing posted reads, because it

increases the bandwidth to some extent no matter the proportion between posted reads

and non-posted reads.

2.2.2 Interrupts

As a part of the parallel interface there are 2 interrupt pins. These interrupt pins can be

programmed to act on different properties of the switch chip. We will not go into details

with these properties, because the programming of them is done in software. The prop-

erties are programmed in register on the switch chips or the MPLS co-processor.

2.3 Serial interface (JH)

The Vitesse switch chips can be accessed through a serial interface. The serial interface

transfers data from one unit to another one bit at a time. The serial interface on the Vi-

tesse switch chip, also referred to as the SI, is a variation of the Serial Peripheral Inter-

face (SPI) bus. The SPI bus is an industry standard and a component using the SPI bus,

can operate either in master or in slave mode [6]. Because the external CPU system ac-

cesses the Vitesse switch chip and not the other way around, the PCIe-Bridge will act as

master and the switch chip as slave.

The SPI interface consists of two control signals and two data signals. The two control

signals are the serial clock and the chip select, and the data signals are data in and data

out. A complete signal overview for the SI on the Vitesse switch chip can be seen in

Table 2-7: Description of the signals of the SPI bus.

Table 2-7: Description of the signals of the SPI bus.

Signal Type Description

SI_CLK I Serial clock. This has a maximum value of 25 MHz

SI_DI I Serial data in line, for receiving data.

SI_DO O Serial data out line, for transmitting data

SI_nCS I Chip select line, indicating when chip is being addressed

The serial clock has 25 MHz as maximum allowed frequency (Appendix E.2). If the

frequency is higher, there will be a risk of data corruption. While the SI is idle, the serial

clock is held low, and then going high when the first bit is ready to be sent. The chip

select is an active-low signal, telling if the chip is being addressed. It is held high when

idle and when a request is initiated, the chip select is pulled low.

The SI on the Luton26 and Jaguar Vitesse chips can handle read or write requests. A

write request for Luton26 will be executed as seen in Figure 2-8


33

WR Addr22 Addr21 Addr1 Addr0

Min.

40 ns

SI_CLK

SI_nCS

SI_DI

SI_DO

Data31 Data30 Data1 Data0

Figure 2-8: Timing diagram of a write sequence to the Luton26 chip, through the serial

interface.

The write sequence starts with sending a read/write bit followed by a 23-bit address

followed directly again by 32-bits of data [7]. The bits are sent on the rising edge of the

serial clock.

The read sequence contains an interval of waiting in addition to sending the address and

retrieving data. The access time of the registers is maximally 1 µs and therefore the SI

will have to wait that time period before it can be certain that the accessed data is avail-

able. There are different methods of handling this wait interval. The one used in the

PCIe-Bridge is the setup where SI_CLK is held high for at least 1 µs. A read request

can be seen in Figure 2-9

WR Addr22 Addr21 Addr1 Addr0

Min. 1 µs

Data31 Data30 Data1 Data0

Min.

40 ns

SI_CLK

SI_nCS

SI_DI

SI_DO

Figure 2-9: Timing diagram of a read sequence using Luton26 serial interface.

The variation of the SPI bus lies in how a request is executed. The SPI bus supports data

being transferred in both directions at the same time. However, in the Vitesse switch

chips, data is only transmitted or received. Also for the SPI interface, there is no stan-

dard length of address or data. In addition, the read/write bit does not have to be the first

bit transmitted.

The version found on the Jaguar chips also has a word length of 32-bits but it has an

address length of 22-bits instead of the 23-bits that the Luton26 chip has. The read and

write request executions in the Jaguar chip though are equal to those in the Luton26

chip, with bit 22 in the address being a “do not care” bit.


34

Because the SPI interface is an industry standard, it is supported by many components,

other than the Vitesse switch chips. Its presence therefore brings many additional possi-

bilities to the PCIe-Bridge. The SPI interface will be connected to two Vitesse switch

chips, the MPLS module and an additional Flash memory. Therefore, four serial inter-

faces are required on the PCIe-Bridge.

2.4 General purpose input / output pins (RBS)

In this application, the PCIe-Bridge is master of the system and the switch chips are

slaves. The master (PCIe-Bridge) needs three general-purpose I/O pins to take control

over the slaves (switch chips). These pins can be used for many different purposes, but

in this setup, they will be used as described in Table 2-8. The I/O pins have to be con-

trollable from the external CPU.

Table 2-8: Description of required general purpose input/output pins for the PCIe-

Bridge.

Pin name Description

nRESET Resetting the switch chips

VCore_cfg1 &

VCore_cfg0

Configuration of the operation

mode of the switch chips


35

2.5 Requirements and usage of interfaces (RBS)

From what we have found out so far, we draw up a table of requirements to the different

interfaces and for the basic functionality of the PCIe-Bridge.

Table 2-9: Table of requirements of the PCIe-Bridge and its interfaces.

Requirements

The PCIe-Bridge

must:

Implement all three layers of the PCI Express interface as a PCI

Express endpoint.

Be capable of interpreting Transaction layer packets from the PCI

Express interface and return completions.

The PCI Express

interface must:

Be able to handle memory requests and return an unsupported re-

quest completion if another type of request is received.

Have capability for returning completions to all supported requests.

Implement the virtual wires functionality of the PCI Express base

specification, for interrupt handling, this should be done by sending

message requests.

The Parallel

Interface must:

Handle reads and writes.

Communicate with three different chips.

Be able to handle posted reads.

Register events on the different interrupt pins.

The Serial

Interface must:

Handle reads and writes.

Communicate with four different chips.

The GPIOs must: Be controllable by sending memory requests to the PCIe-Bridge.

The needed number of interfaces and their required capabilities in the PCIe-Bridge are

shown in Table 2-10. The data pr. transaction is a range for the PCI Express interface

and the parallel interface. The reads and writes of the PCI Express interface have the

capability of being burst reads and writes. The parallel interface has different widths of

the interface.

Table 2-10: Required number of interfaces in PCIe-Bridge.

Interface Count Supported transaction through

the interface

Data pr.

Transaction

PCI Express 1 Memory Read / Write – Message interrupt 1 DW – 32 DW

PI 3 Read / Write 1 byte – 2 bytes

SI 4 Read / Write 1 DW

GPIO 3 Assert / Deassert 1 bit


36

The difference in data pr. transaction of the different interfaces yields that the execution

of a transaction takes different amounts of time. The difference in execution time has to

be dealt with in the design of the PCIe-Bridge.

37

3 SELECTING AN FPGA

The PCIe-Bridge will be implemented using an FPGA. In order to find a suitable FPGA

for this purpose, a rough overview of what is required of the FPGA is necessary. When

the requirement specification for the FPGA is found, an overview of possible solutions

from different FPGA vendors will be done, in order to find the best solution according

to functionality and price.

3.1 FPGA requirements (JH)

It is a necessity that the chosen FPGA will be able to interface to the different interfaces

described in the previous section. Therefore, the specific requirement for each interface

protocol must be available. For the general purpose input/outputs, the parallel interface

and the serial interface this will mean that some user controllable input/output pins on

the FPGA are free for the interfaces to use. For the PCIe interface though, there are spe-

cial hardware requirements for the Physical layer of the PCIe protocol stack.

3.1.1 PCI Express Intellectual Property

The three layers of the PCIe interface should be implemented in an FPGA using an In-

tellectual Property (IP), making an implementation of an PCIe endpoint would be a

project in itself. The PCIe IP featured on an FPGA contains all the three layers of the

PCIe interface, and the developer therefore only has to concentrate on what’s going on

in the Transaction Layer.

These PCIe IPs for FPGAs come in two categories: Hard IP and Soft IP. There are three

variations of PCIe IP solutions available and can be seen in Table 3-1.

Selecting an FPGA

38

Table 3-1: List of possible IP solutions.

Possible IP solutions Description

Hard IP A hard wired circuit implementing a PCI Ex-

press endpoint is included in the FPGA.

Soft IP with internal PHY A hard wired circuit implementing the transceiv-

ers and the rest of the Physical Layer, is in-

cluded in the FPGA. The Data link layer and the

transaction link layer are implemented in the

programmable FPGA logic.

Soft IP with external PHY The Data link layer and the Transaction layer

are implemented in FPGA logic and the Physi-

cal Layer is placed on an external chip.

The external chip opens up for the possibility to use the cheapest series of FPGAs, be-

cause there are no special requirements to the FPGA, other than having enough logical

elements to contain the Soft IP.

3.1.2 IO Pins

The parallel interface, the serial interface and the general purpose IOs featured in the

PCIe-Bridge need a number of controllable IOs on the FPGA. How many each interface

needs is seen in Table 3-2.

Table 3-2: The number of IO pins used by the different interfaces.

Interface # of IO pins needed

Parallel interface 46

Serial interface 4

General purpose IO 1

From Table 2-10 we know the number of required interfaces. The three parallel inter-

faces can each have a PI bus or they can share a single bus. The four serial interfaces

can have a SPI bus each or they can share a single bus. The number of IO pins needed

on the FPGA for these two solutions is seen in Table 3-3.

Selecting an FPGA

39

Table 3-3: The number of IO pins needed in the FPGA, to implement the desired inter-

faces.

Interface Count No. of IO pins

(Full parallel)

No. of IO pins

(Different chip select)

Parallel interface 3 144 50

Serial interface 4 16 7

General purpose IO 3 3 3

Total pins 163 60

The aim for the FPGA will be to have at least 60 IO pins available for the interfaces. It

would be an advantage if it supported bus for each interface with 163 IOs.

3.2 Overview of possible FPGAs (JH)

There are three main vendors that deliver low-cost FPGAs with PCI Express IPs. Those

are Xilinx, Lattice and Altera. The unit price for the Altera FPGAs and the Xilinx Spar-

tan 3 are from Digi-key. The Xilinx Spartan 6 is unit price at Avnet and the Lattice is

from Lattice’s own webpage.

The low-cost FPGA solutions and their price can be seen in Table 3-4. These FPGAs

have been chosen so they meet the minimum requirements of needed IOs. Furthermore,

for the FPGAs that need a soft IP, there should be enough leftover logic for the imple-

mentation of the PCIe-Bridge.

Table 3-4: Price of possible FPGA solutions.

Vendor FPGA fam-

ily Model number

Unit price External

PHY cost Total cost

[$] [$] [$]

Xilinx Spartan 3 XC3S50A-4FTG256C 8,8 15,4 24,2

Xilinx Spartan 6 XC6SLX25T-

2FG484C

64,3 0,0 64,3

Lattice ECP2M LFE2-12E-5QN208C 28,8 0,0 28,8

Altera Cyclone II EP2C5Q208C8N 13,9 15,4 29,3

Altera Cyclone IV EP4CGX15BN11C8N 24 0,0 25,2

The price of interest is the high volume price of the components. It is hard to say what

the high volume prices of the FPGAs are, so therefore we use the unit price as a refer-

ence, when choosing an FPGA.

The cheapest solution in one unit is the Altera Cyclone IV FPGA. Because the Cyclone

IV appears to be the cheapest solution, it will be the FPGA of choice. Furthermore, the

Selecting an FPGA

40

Cyclone IV series is a new family of FPGAs from Altera released in 2010, and therefore

the prices are likely to fall over time.

3.3 About the chosen FPGA (JH)

The Cyclone IV is a family of different FPGAs. Cyclone IV comes in two series, the E-

series and the GX-series. It is the GX-series that comes with the targeted PCI Express

Hard IP. Therefore the chosen FPGA is of the Cyclone IV GX series. The attributes for

the lowest cost FPGA in the Cyclone IV GX series can be seen in Table 3-5

Table 3-5: Attributes of chosen Cyclone IV GX FPGA

FPGA chosen Product no. PCIe IP No. of IOs Logical Elements

Altera Cyclone IV EP4CGX15 Hard IP 72 14400

These attributes fulfill the requirements for supporting all the necessary interfaces when

using chip select signals to components. The Hard IP implements an x1 lane PCI Ex-

press connection which has a bandwidth of 200 MB/s.

The Altera PCI Express module, which can be instantiated in the Quartus II design tool,

can be instantiated in two different modes, with two different interfaces:

Avalon-MM: This interface executes one request at the time, through a data and

an address based interface.

Avalon-ST: This interface has two streaming interfaces, a receiving interface

and a transmitting interface. Multiple requests can be executed at the time.

The split transaction functionality of the PCI Express specification is only used by the

Avalon-ST interface [Avalon-ST interface, 7].

41

4 DESIGN OF THE PCIE-BRIDGE

The PCIe-Bridge will be designed in this section, such that it meets the requirements

found in section 2.5. The design will seek to increase the throughput to a degree where

it is the parallel interface and the serial interface that are the restricting factors.

4.1 Structural design (RBS)

The structure of the PCIe-Bridge design depends on which interface type, described in

section 3.3, is chosen for the PCIe-module. The options are Avalon-MM and Avalon-

ST. It is required that the PCIe-Bridge have the ability of doing posted reads through the

PI and to utilize more than one of the different interfaces at the time. This yields that the

interface of the PCIe-module should support the split transaction feature of the PCI Ex-

press interface. The Avalon-ST is the only one of the Avalon interfaces that supports

split transactions through the PCI Express interface.

The structural design of the PCIe-Bridge must accommodate the given requirements.

The main requirement is full utilization of the PI bandwidth. This suggests that there

must be a separate control unit for each interface, and the possibility of prioritizing the

completions of the different interfaces. These considerations yield a block diagram as

seen in Figure 4-1. All the blocks in the block diagram contain a control unit, for con-

trolling the execution of incoming requests.

Design of the PCIe-Bridge

42

PC

I Exp

ress m

od

ule

(PC

Ie-m

od

ule

)

Ha

rd IP

Ava

lon

-ST

imp

lem

en

tatio

n

The TLP multiplexer

(TLP-Multiplex)

SI R

eq

ue

sts

(SI-m

od

ule

)

PI R

eq

ue

sts

(PI-m

od

ule

)

PCI Express Bridge (PCIe-Bridge)

The TLP switch

(TLP-Switch)

I/O a

nd

Co

ntro

l

Re

qu

ests

(Co

ntro

l-mo

du

le)

Un

su

pp

orte

d

Re

qu

ests

(Un

su

p-m

od

ule

)

PI In

terru

pt

Re

qu

ests

(PI-IR

Q-m

od

ule

)

Figure 4-1: Block diagram of the structural design of the PCIe-Bridge.

The TLP-Switch routes the incoming Transaction Layer Packet requests to their in-

tended module and the TLP-Multiplexer forwards the outgoing completions, in priori-

tized order, to the PCIe-module. To make the components easier to test, the packets

remain in the TLP format until they are processed in the modules. All the components

now have the same packet-based interfaces and can be tested individually.

Inside the modules, FIFOs store the packets in TLP format, until the packets are

processed. The size of these FIFOs is important to keep the bandwidth of the PCIe-

Bridge high. If the FIFO is too small, packets routed to the relevant module will remain

in the TLP-Switch and block for packets to the other modules. These packets will block

the TLP-Switch until there is room in the relevant FIFO again. A single packet can

block the TLP-Switch, if the FIFO in the relevant control unit is smaller than the packet

itself. The FIFO should therefore be larger than the size of the largest possible TLP. The

size of the largest possible TLP is 19 QWs, two QWs for the header and 17 for the data.

The maximum data payload in this application is 32 DWs, and if the address of the re-

quest is non-QW aligned the 32 DWs will span over 17 QWs. The FIFO in the SI-

module is the most critical with regard to size, because the SI is the slowest interface,

and thereby has the lowest throughput.

If multiple requests are waiting to enter the SI-module, and thereby blocking the TLP-

Switch, the bandwidth of the bridge will drop to the level of the SI bandwidth for as

long time as it takes to process the packets waiting to enter the SI-module.


43

4.2 Components of the structural design

The design process of each component depicted in Figure 4-1 will be described in this

section.

4.2.1 PCIe-module (JH)

The PCIe-module implements the PCI Express Link to the external CPU. This means

implementing the three layers of the PCI Express protocol. As stated in section 3.1.1

this is done using the Hard IP on the chosen FPGA instantiated as an Avalon-ST inter-

face. The instantiation of the PCIe-module is done using the MegaWizard plug-in man-

ager. The Hard IP block can be configured by running the MegaWizard plug-in manag-

er. The module is configured as an endpoint with an address space of 27-bits. In addi-

tion, the registers in the configuration space, mentioned in section 2.1.1 will be set.

Configuration requests to the PCIe-Bridge then be handled by the PCIe-module.

The Hard IP on the chosen FPGA only supports maximum TLP data payloads of 128 or

256 bytes. For the PCIe-Bridge, the maximum payload will be set to 128 bytes.

Acquiring data from the Avalon-ST module happens through a 64-bit data bus. The pro-

cedure of sending one packet can be seen in Figure 4-2.

Valid

Data

StartOfPacket

EndOfPacket

Ready

CLK

Qword 1 Qword 2 Qword 3

Avalon-ST

Figure 4-2: Timing diagram of Avalon-ST interface.

The order of the header and data DWs, in the Avalon-ST can be seen in [Avalon-ST

interface, 4].

The First and Last byte enable fields mentioned in section 2.1.2 will not be used in the

PCIe-Bridge design. This is because the registers read or written through the parallel

and the serial interface are all 32-bit registers. The addresses of the parallel and serial

interface are DW aligned. The byte offset possibility, that comes with the first and last

byte enable, will therefore not be used.


44

4.2.2 TLP-Switch (RBS)

The TLP-Switch uses address based routing to route the TLPs from the PCIe-module to

their intended module. For a detailed description of the address space of the PCIe-

Bridge, see section 4.3. When interfacing through the 64-bit Avalon-ST interface the

address field is in the 2nd

QW of the TLP. Therefore, the 1st QW of the TLP must be

saved until the address from the 2nd

QW is found, and then the 1st QW must be sent be-

fore the 2nd

QW. For further information on the alignment of data from the 64-bit Ava-

lon-ST interface se, [Avalon-ST interface, 4]. After the 2nd

QW is sent through the

switch, the potential data payload is sent. A state diagram of the switch state machine

can be seen in Figure 4-3.

Init

S0

S0_

wait

S1S2

S2_

wait

S2_

writeEmpty = 0

Empty = 1

Empty = 1

& Sop = 0

Empty = 1

& Sop = 1

Empty = 0

& Sop = 1

Empty = 0

Empty = 1

Ready_out = 0Ready_out = 1

Ready = 1

& Prev_eop = 0

& Empty = 1

Ready = 1

& Prev_eop = 0

& Empty = 0

Ready = 1

& Prev_eop = 1

& Empty = 1

Ready = 1

& Prev_eop = 1

& Empty = 0

Empty = 0

Ready = 1

& Eop = 0

& Empty = 1

Ready = 0 ||

(Eop = 0 &

Empty = 0) Ready = 1

& Eop = 1

& Empty = 0

Ready = 1

& Eop = 1

& Empty = 1

Empty = 0

& Sop = 0

Empty = 1

Ready = 0

Figure 4-3: State diagram of the TLP-Switch.

A short description of what happens in each state of the state machine in Figure 4-3:

Init – Waiting for input from the PCIe-module.

S0 – The first QW of the packet is saved.

S0_wait – The state machine goes to this state if there is a pause in the data stream.

S1 – The second QW of the packet is saved, and it is decided where to route

the packet. The first QW of the packet is sent.

S2 – The second QW of the packet is sent.

S2_wait –The state machine goes to this state if there is a pause in the data stream.

S2_write – Another QW of the packet is sent.

Besides routing the packets to their address, the TLP-Switch should catch unsupported

TLP types and send them to the Unsup-module. The type field of the TLP can be found


45

in the 1st QW of the header. The PCIe-Bridge only supports memory reads and writes,

on the incoming port. The number of output ports on the switch module and their de-

scription can be seen in Table 4-1. All the output ports use the 64-bit Avalon-ST inter-

face.

Table 4-1: The number of output ports on the TLP-Switch and a description of which

requests are send to which output port.

Number Name Description

1 PI Queue 1 Normal register reads from chip 1



4 PI Queue 4 Fast reads/writes and normal writes from all chip

5 SI Requests through the SI

6 Control Control requests

7 Extra output Output for future use

8 Unsupported Unsupported requests: I/O and Config. requests

Sending the requests to different output ports, that have different throughput, will cause

the completion of the requests to be transmitted in another order that they were re-

ceived, this is not a problem because the PCI Express interface supports split transac-

tion.

To decide whether a packet should be routed to output port 4 rather than to one of the 3

preceding output ports, the switch must know the addresses of fast register.

The addresses for the fast registers have not been written to the code, because they are

not to be found at this time in the datasheets, the addresses that are used are the ad-

dresses of the switch chips Luton28. The Luton28 is an older Vitesse switch chip with a

similar parallel interface. The addresses of the Luton28 should be change to the valid

addresses when they become available.

It is in the current version of the TLP-Switch a limitation is that there is no way of by-

passing a packet blocking the TLP-Switch. The only action that can be taken is to wait

for the blocking packet to go away, or to reset the TLP-Switch by power down. This

problem can be prevented by implementing a counter that discards the packet after a

certain amount of time or routes the packet to the Unsup-module. This feature has not

been implemented because of the lack of time.

4.2.3 TLP-Multiplex (RBS)

The TLP-Multiplex directs the completion packets from the modules back to the PCIe-

module. At this point in the design a new problem shows up, the TLP-Multiplex works


46

like a funnel, the data from 6 64-bit Avalon-ST interfaces should fit into a single 64-bit

Avalon-ST interface. This yields the need of prioritization of the packets from the dif-

ferent modules. The priority of a module should depend on the usage of the module.

Packets from the PI-IRQ-module have the highest priority to minimize the latency of

the interrupt message. The PI-module is the most frequently used interface under nor-

mal conditions, therefore it should have a high priority. The Unsup-module should natu-

rally have the lowest priority.

Because the PCI Express endpoint implementation in the chosen FPGA only supports

one virtual channel, we must settle with prioritizing the incoming packets instead of

mapping them to different virtual channels. The prioritizing function is made with a

priority encoder, choosing between the 6 Avalon-ST input ports of the TLP-Multiplex.

To distinguish between the Avalon-ST input ports that have packets for transmission

and the Avalon-ST input ports that do not have packets for transmission, we have made

a small alteration to the Avalon-ST interface on the TLP-Multiplex.

An Avalon-ST input port requests for permission to send a packet through the TLP-

Multiplex by asserting the StartOfPacket signal, while the ready signal is held low by

the TLP-Multiplex. When the ready signal is asserted from the TLP-Multiplex the rest

of the procedure of the altered Avalon-ST interface is as the standard Avalon-ST inter-

face. The timing diagram of the altered Avalon-ST interface is show in Figure 4-4.

Compare this with Figure 4-2 to see the effect of the alteration.

Valid

Data

StartOfPacket

EndOfPacket

Ready

CLK

Qword 1 Qword 2 Qword 3

Altered Avalon-ST

Figure 4-4: Timing diagram of altered Avalon-ST interface used in the TLP-Multiplex.

In case multiple StartOfPacket signals go high at the same time, the priority encoder

picks the one with the highest priority and directs it through to the PCIe-module. After

the chosen packet is through the ready signal goes low again. The timing diagram when

multiple StartOfPacket go high at the same time is seen in Figure 4-5.


47

Altered Avalon-ST

EOP_1

SOP_1

Ready_2

CLK

SOP_2

Ready_1

EOP_2

Figure 4-5: Timing diagram of altered Avalon-ST, when multiple StartOfPacket signals

are high.

Where, SOP_1 & SOP_2 are the StartOfPacket signals of input port 1 and input port 2.

EOP_1 & EOP_2 are the EndOfPacket signals of input port 1 and input port 2. It is seen

that input port 1 has the highest priority of the two input ports.

The state machine of the TLP-Multiplex is shown in Figure 4-6. In the idle state, the

priority encoder decides which incoming packet should be routed through. In the input

states, the packet is then routed through.


48

Idle

Input1

Input2

Input3

Input4

Input5

Input6

SOP = ”1-----”

SOP = ”01----”

SOP = ”001---”

SOP = ”0001--”

SOP = ”00001-”

SOP = ”000001”

EOP1 = ’1'

EOP2 = ’1'

EOP3 = ’1'

EOP4 = ’1'

EOP5 = ’1'

EOP6 = ’1'

EOP1 = ’0'

EOP2 = ’0'

EOP3 = ’0'

EOP4 = ’0'

EOP5 = ’0'

EOP6 = ’0'

Figure 4-6: State diagram of state machine in the TLP-Multiplex.

The priority of the different input ports of the TLP-Multiplex is seen in Table 4-2

Table 4-2: Priorities of input ports of the TLP-Multiplex.

Priority Name

1 PI interrupt

2 PI

3 SI

4 Control

5 Additional

6 Unsupported

The additional input port is for future use, this could be another type of interface. There

is also an extra output port on the TLP-Switch.

With a priority encoder, there is a chance that a packet from the ports with a low priority

will never be sent if the other interfaces keep blocking the TLP-Multiplex. To be sure

that no ports are block continuously, a weighed round robin encoder could be imple-

mented, instead of the priority encoder.


49

4.2.4 PI-module (RBS)

The requirements to the PI-module is the ability to communicate with 3 chips, and that

the PCIe-Bridge should be fast enough, so that it is still the parallel interface that is the

restricting factor, with regards to bandwidth.

On the chosen FPGA there is not enough I/O pins for having 3 independent parallel

interfaces. An FPGA with enough pins for three independent parallel interfaces, would

be too expensive for this solution to be feasible. The 3 Vitesse chips must be connected

to a shared bus with 3 different chip-select pins. Sharing the bus prevents the possibility

of the PCIe-Bridge being fast enough to fully utilize all 3 parallel interfaces at once, so

a compromise is that the PCIe-Bridge should be fast enough to fully utilize the shared

bus. The shared bus has the capability of having a posted read in progress on each of the

chips connected to the shared bus. In this section we will only distinguish between post-

ed read operations and all other operations, denoted non-posted reads. All non-posted

reads have the same properties concerning access time.

The maximum performance of the shared bus depends on the percentage of posted reads

and the number of posted reads in progress at the time. In Figure 4-7 the maximum

bandwidth of the shared bus is displayed, with 1 and 3 posted reads in progress. In the

same figure, the maximum bandwidth of 3 separate parallel interfaces is displayed to

show the cost of the compromise of choosing a shared bus over 3 separate parallel inter-

faces. The calculation of these curves have been made in the same way described in

section 2.2.1. With 3 posted reads in progress at the time.


50

Figure 4-7: The maximum bandwidth of the shared-bus when connected to 3 16-bit

parallel interfaces.

Figure 4-7 shows that the bandwidth of the shared bus when 3 posted reads in progress

is about twice as high as the bandwidth of the shared bus when 1 posted reads in

progress at 50 % of slow reads. The calculated maximum bandwidth of the shared bus is

30.77 MB/s.

In the performance calculations of the graphs in Figure 4-7 it is assumed that the posted

reads are equally distributed across the 3 switch chips, this is not an assumption that

will hold in the real application. The maximum bandwidth will decrease the more un-

evenly the posted reads are distributed.

Averaging over time will smooth out the distribution of the posted reads. Averaging

over time can be done by directing the posted reads into 3 different queues, one for each

Vitesse switch chip. When executing from the 3 queues they should then have equal

priority. The distribution of the non-posted reads does affect the bandwidth of the paral-

lel interfaces connected to the shared bus, but it does not affect the bandwidth of the

shared bus. Therefore, all the non-posted reads will be directed into a single queue. In

one single queue the non-posted reads are executed in the same order as they were re-

quested, this helps to keep the latency low. The queues will be implemented in FIFOs.

The PI-module has 4 inputs one for each queue. The TLP-Switch then decides which

queue the incoming TLP should be directed to, thereby avoiding the need to decode the

TLP more than the one time in the switch module. Only one TLP can be completed at

the time, so there will only be one output of the PI-module.


51

Because the execution of a posted read is relatively slow compared to the non-posted

reads, we leave out the functionality of burst-posted-reads to save the complexity. Leav-

ing out the burst posted reads functionality, will increase the amount of data needed to

be transferred through the PCIe interface. Because of the excess of bandwidth through

the PCI Express interface, this will not set any restriction on the parallel interface.

The flowchart in Figure 4-8 shows the flow of which requests are executed, with the

possibility of 3 posted reads in progress at the time. It is seen in the flowchart that there

is no support for burst-posted-reads. This limitation should be taken care of in software,

because the PCI Express standard still supports burst reads even though it is from a

normal register.

Initialize

Chips

Queue 1

Empty?

Queue 2

Empty?

Queue 3

Empty?

Queue 4

Empty?

Any slow

Reads done?

Slow read

In progress?

Slow read

In progress?

Slow read

In progress?

Execute from

Queue 1

Execute from

Queue 2

Execute from

Queue 3

Execute from

Queue 4

Complete

Slow read

Yes Yes Yes

No No No

No

NoNoNo

Yes Yes Yes

No

Yes

Yes

Figure 4-8: Flowchart for executing requests in the control state machine of the parallel

interface.

The execution of a request requires 3 tasks.

Selecting the request to execute.

Reading or writing data through the PI.

Sending the completion of the request.


52

The selection of the request is done by implementing the flowchart in Figure 4-8. The

read or write operation through the PI should proceed as described in the section 2.2.

The completion of the request should follow the rules found in the section 2.1.2.

The three tasks of the execution of a request should function in parallel to maximize the

utilization of the PI. Implementing these 3 tasks in 3 different state machines will help

keep the complexity low. The structure of the FIFOs and the state machines in the PI-

module is seen in Figure 4-9.

PI-module

Addr

Data

TLP

TLP

TLP

TLP

TLP

header

TLP

Data

TLP

TLP

TLP

TLP

TLP

PIFIFO 1

Slow reads

FIFO 2

Slow reads

FIFO 3

Slow reads

FIFO 4

Fast reads/

writes

PI

FSM

Control

PI

FSM

Compl

PI

FSM

Interface

Output

FIFO

Done

new_TLP

Req

Figure 4-9: Block diagram of the PI-module.

The control state machine requests a completion from the completion state machine

when it starts of a read request on the interface state machine. The interface state ma-

chine returns data and a data valid signal to the completion state machine and a done

signal to the control state machine.

To prevent the PI-module from stalling because it cannot transfer the data of the com-

pletion to the TLP-Multiplex, a FIFO is placed on the output of the completion state

machine.

When each QW of a completion is ready the QW is loaded into the output FIFO.

Between the output FIFO and the output port there is some logic for communicating

with the TLP-Multiplex. The logic asserts the StartOfPacket signal, and sends the pack-

et when the ready signal from the TLP-Multiplex goes high.

The state diagrams of the 3 state machines in the PI-module is shown in the Figure

4-10, Figure 4-11 and Figure 4-12.


53

Init

Header

Slw_done1 = 1 || slw_done2 = 1 ||

slw_done3 = 1 ||

(empty_1 = 0 & usedw1 > 1) ||

(empty_2 = 0 & usedw2 > 1) ||

(empty_3 = 0 & usedw3 > 1) ||

Wait for

Done

Done = 0 &

Repeat ≠ 0 &

Data_left = 1

Done = 0 &

Repeat = 0

Empty = 0

Slw_done1 = 0 & slw_done2 = 0

& slw_done3 = 0 &

(empty_1 = 1 || usedw1 ≤ 1) &

(empty_2 = 1 || usedw2 ≤ 1) &

(empty_3 = 1 || usedw3 ≤ 1) &

Done = 1 ||

(Repeat ≠ 0

& Data_left ≠ 1)

Done = 0 &

Repeat = 0

WR = 1 &

QW_align = 0

WR = 1 &

QW_align = 1

WR = 0

Empty = 1

Decision

transdata

Stall

transdata

1

Figure 4-10: State diagram of the PI ctrl state machine.

If the control state machine executes a read it goes to the Stall state, because it waits for

the interface state machine to be done with all the read operations. If the control state

machine executes a write, the state machine goes to transdata or transdata1 depending

on the QW alignment. The states transdata and transdata1supply the interface state ma-

chine with data.


54

Idle

Addr

Chip_selWait_st

Done_st

Req = 1

PI_nDone = 0

Req = 0

Req = 1 ||

Repeat > 0

PI_nDone = 1

Req = 0

Figure 4-11: State diagram of the PI interface state machine.

The interface state machine follows the specified pattern in section 2.2 for communicat-

ing with the PI.

Idle

Last_

head_qwWait_st

Qword_aligned = 1

|| WR = 1

WR = 0

Input_valid = 1 &

((Bytes = lengthm2 & pi16n8 = 1) ||

(Bytes = lengthm1 & pi16n8 = 0))

New_TLP = ’0'

Input_valid = 0 ||

(input_valid = 1 &

((Bytes ≠ lengthm2 & pi16n8 = 1) ||

(Bytes ≠ lengthm1 & pi16n8 = 0)))

Qword_aligned = 0

& WR = 0)

WR = 1

Figure 4-12: State diagram of the PI completion state machine.

The completion state machine makes a completion header and loads the data received

from the interface state machine into the output FIFO.


55

4.2.5 PI-IRQ-module (RBS)

Each switch chip has 2 interrupt pins that can be programmed to interrupt on different

settings. In the PCI Express standard INTx interrupt signaling there are four virtual

wires used for handling interrupts. We use all the virtual wire, we assign two wires to

one of the switch chips and one wire to each of the two other switch chips.

Every time an interrupt pin changes state, the PI-IRQ-module sends a message TLP

with information on which pin and if the pin was asserted or deasserted. For an interrupt

to be detected in the PI-IRQ-module, the interrupt pin must be high across a rising edge

of the internal clock in the PCIe-Bridge. In the PI-IRQ-module there is two registers for

each interrupt pin, one register is set high if the interrupt pin is set asserted the other

register is set high if the interrupt pin is deasserted. The registers are set low when the

relevant TLP message is sent.

4.2.6 SI-module (JH)

The requirements for the serial interface (SI) are, that it should be able to interface to

four different chips. Because of the restricted number of IO pins of the FPGA, the SI

will as the PI not have an independent SI to each chip, but have a chip select signal to

each chip sharing the data-in, data-out and clock signal.

The maximum performance of the SI is reached when running continuous writes on the

SI with a serial clock frequency of 25 MHz. The wait period of a read is what makes it

slower than a write. For each transaction, four bytes are transferred over the SI. The

maximum bandwidth of the SI can be seen in Table 4-3.

Table 4-3: Maximum bandwidth for SI reads and writes

SI requests Transaction time

[ns]

Bandwidth

[MB/s]

Write 2240 1.78

Read 3240 1.23

The SI must support burst reads and burst writes, and a write with maximum payload of

32 DW would take 71,68 s to finish when running at maximum performance. A FIFO

will be added to the input of the SI module as stated in section 4.1. By doing this, the

TLP is brought in to the FIFO and the PCIe-Bridge can carry on with other requests at

the same time that the write request is being executed.

The same problem will be experienced at the output if a maximum payload read request

is sent to the SI module. A read request of 32 DW would take 103,68 s to finish. If

there is no FIFO at the output of the SI module, the TLP-Multiplex would receive the

completion header, and then wait while the SI performed the rest of the reads. This

would stall the PCIe-Bridge.

These observations indicate that an input FIFO is necessary. An output FIFO that can

hold a maximum payload TLP could also solve the solution on the output of the SI-


56

module. A second solution could be to utilize the byte count value of the completion

TLP header. This is done by sending the completion in multiple TLPs with the byte

count indicating how many bytes remain to be read. Since the chosen FPGA has enough

area available for the FIFO implementation, that will be the chosen solution. The FIFO

solution only sends the completion header once in contrary to the byte count solution.

The FIFOs will have some logic connected, controlling the flow into the input FIFO and

out of the output FIFO. The control of the SI and the SI itself will lie in between these

two FIFOs as seen in Figure 4-13.

SI-module

Addr

Data

TLP

TLP

Data

TLP

TLP

SIInput

FIFO

SI

FSM

Control

SI

FSM

Output

SI

FSM

Interface

Output

FIFO

Done

TLP

TLP Done

SI

FSM

Input

TLP

Figure 4-13: Block diagram of the SI-module.

By using this linear model, the SI requests will be executed in the order they are re-

ceived. Also, this block diagram is structured so that the SI, which relative to the PI is

slow, is isolated from the rest of the PCIe-Bridge. The requests are received, and ex-

ecuted by the SI FSM Control and the completion is sent to the output FIFO. When a

request has been handled, a done signal is sent to the SI FSM Output, and only then the

completion TLP is sent. Sending a completion will with this setup only take as many

clock cycles as there are QWs in the completion TLP. The SI FSM Input, SI FSM Con-

trol and SI FSM Output state machines are clocked using the standard 125 MHz clock,

while the SI FSM Interface state machine runs on a clock that is at best 25 MHz.

The logic at the input controls how the TLPs are put into the input FIFO. The two-state

FSM is seen in Figure 4-14.


57

sop_i rem_i

valid_in = ’1' &

sop_in = ’1' &

full_i = ’0'

full_i = ’0' &

eop_in = ’1'

valid_in = ’0' ||

sop_in = ’0' ||

full_i = ’1'

full_i = ’1' ||

eop_in = ’0'

Figure 4-14: State diagram of SI FSM Input.

If the input FIFO is not full, the input FSM waits for a StartOfPacket and valid signal

from the TLP Switch before it starts putting data into the queue. When an EndOfPacket

signal is received, the packet has been placed in the input FIFO, and the input logic

starts waiting for a new StartOfPacket from the TLP-Switch.

Immediately when the input FIFO is not empty, the SI FSM Control starts to handle the

request. The state diagram can be seen in Figure 4-15.

idle

header

header_2

get_

data64q

interface

check_

done

empty_i = ’0'

empty_i = ’0' &

full_o = ’0'

(b32_64 = ’0' & q_i(4) = ’0') ||

(b32_64 = ’1' & WR_s = ’1')

(b32_64 = ’0' & q_i(4) = ’1') ||

(b32_64 = ’1' & WR_s = ’0')

done = ’1'

empty_i = ’1'

empty_i = ’1' ||

full_o = ’1'done_

write

done_

read

check = ”11" &

done = ’0' & full_o = ’0'

done = ’0'

read_

next

check = ”00" &

done = ’0' & full_o = ’0'

done = ’1' ||

full_o = ’1'

write_

next

addr0(2) = ’1'

addr0(2) = ’0'

check = ”10" &

done = ’0' & full_o = ’0'

check = ”01" &

done = ’0' & full_o = ’0'

header_sent = ’1' ||

addr0(2) = ’0'

header_sent = ’0' &

addr0(2) = ’1'

Figure 4-15: State diagram for SI FSM control.


58

The SI FSM control starts by fetching the TLP header from the input FIFO and saving it

to registers. Then it is determined whether it is necessary to get more data from the in-

put FIFO by checking if the address is QW aligned or the address length is 64-bits.

After each read or write, the SI FSM Control checks, according to the Avalon-ST rules,

whether: the TLP request is done, new data has to be fetched from input FIFO or if eve-

rything is ready to just start the SI one more time. In contrary to the PI module, the SI

FSM Control both handles the request and sends a completion to the output FIFO. In the

PI module, this is done using an external block handling completions. The external state

machine would save away the four states: read_next, write_next, done_read and

done_write. This would only save one, and at best two clock cycles per TLP request.

The SI FSM Control runs on a 125 MHz clock, while the SI FSM Interface runs on a

clock with maximum frequency 25 MHz. Therefore, the SI FSM Control will handle

read_next and write_next states before next rising edge in the SI FSM interface. This

means that which means that at best 16 ns would be saved. A single write request,

which is the transaction taking the least time, takes 2240 ns, and relative to this, the

16 ns more used by the SI FSM Control is accepted.

When the SI FSM control reaches the interface state, an SI read or write is started. The

SI read or write is executed using the FSM seen in Figure 4-16.

idle

send_

addr

req = ’1'

write_st

wait_

state

WR = ’0' &

addr_counter = 0

read_st

finish

data_counter = 0

data_counter = 0

WR = ’1' &

addr_counter = 0

counter = 1

req = ’0'

addr_counter > 0

data_counter > 0

counter > 1

data_counter > 0

Figure 4-16: State diagram of interface state machine in the SI-module.


59

The SI interface FSM is derived from the timing diagrams of the SI interface seen in

section 2.3. After data has been read or written, a done signal is sent back to the SI FSM

Control. Because of the difference in clock frequency, the SI FSM Control waits, for the

SI FSM Interface to go back to idle state, before continuing from the check_done state.

When a TLP read or write request is done, the last QW of the completion TLP is put

into the output FIFO. The counter indicating how many whole TLPs are in the output

FIFO is incremented by one. When this counter is more than zero, the output logic as-

serts the StartOfPacket signal at the output in the sop_o state seen in Figure 4-17.

done

idle

sop_o

rem_o

counter > 0

ready_out = ’1'q_o(0) = ’1'

counter = 0

ready_out = ’0'

q_o(0) = ’0'

Figure 4-17: State diagram of output state machine in the SI-module.

The FSM output logic implements the process of sending TLPs to the TLP multiplexer,

by after the sending the remainder of the packet in the rem_o state. After this it goes

back to check if there is a whole TLP in the output FIFO.


60

4.2.7 Control-module (RBS)

The Control-module is controlling GPIOs and internal control signals of the PCIe-

Bridge. As mentioned in section 2.4 there are 3 pins that should be controlled. Internally

in the PCIe-Bridge, the completer ID signal should be passed around to all modules that

return completions. The completer ID can be fetched from the PCIe-module, through

the tl_cfg interface. This interface has an address output and a data output. The Control-

module should refresh the completer ID every time the address of the tl_cfg interface

gets to the address of the completer ID. The GPIOs should be controlled from the exter-

nal CPU, this is done by sending memory requests. The address space of the Control-

module is so small that there is no need for the Control-module to support burst re-

quests. The address space of the Control-module is seen in Table 4-4.

Table 4-4: Address space of the Control-module.

Address Signal

0000 EP_ID

0001 nRESET

0010 VCore_cfg0

0011 VCore_cfg1

Rest Reserved

The Control-module has an input FIFO in the input and an output FIFO on the output.

In between the two FIFOs there is a state machine for executing the requests. The state

diagram of the state machine controlling the execution of requests can be seen in Figure

4-18.


61

Idle

Addr_st

ExecRead_st

Compl

Error

Sop = 0 ||

usedw_in ≤ 1||

usedw_out ≤ 60

sop = 1 &

usedw_in > 1 &

usedw_out < 60

(Wr = 1 &

Last_eop = 1) ||

(Wr = 0 &

addr(2) = 1

Last_eop = 1)

Wr = 0 &

Addr(2) = 0

(Wr = 1 &

Last_eop = 0) ||

(Wr = 0 &

addr(2) = 1

Last_eop = 1)

Eop = 1

Eop = 0Prev_last_eop = 0

Prev_last_eop = 1

Eop = 1Eop = 0

Figure 4-18: State machine of the Control-module.

4.2.8 Unsup-module (JH)

If a TLP is received at the TLP Switch, that is not of the type Memory read or write, the

request is not supported. The packets are then sent to the Unsup-module, where the

packet first is put into a FIFO. Then the transaction ID is fetched from the header of the

TLP, and from this a completion TLP is sent back through the TLP multiplexer with the

completion status field set to unsupported request. The state diagram for the FSM mak-

ing the completion TLPs is seen in Figure 4-19.


62

fin_

complfin_req

sop = 1 &

usedw_in(1) = 1 &

usedw_out < 60

eop = 0

eop = 1

idle

eop = 1

eop = 0

sop = 0 ||

usedw_in(1) = 0 ||

usedw_out ≥ 60

Figure 4-19: State diagram for unsupported TLP module.

The idle and fin_compl states send the completion TLP to the output FIFO. The fin_req

state then removes the rest of the request from the input FIFO.

4.3 Address space of the PCIe-Bridge (JH)

The PI interface for the Jaguar switch has an address length of 24-bits. The address

space of the PCIe-Bridge needs to have room for three of the 24-bit address PI. In addi-

tion to this there must also be room for four SI components with a 23-bit address space

each. For the address space, this means that the minimum size address space for the

PCIe-Bridge is 27-bits.

The TLP Switch uses address based routing of packets. The PCIe-Bridge endpoint must

be set to have a 27-bit address space. Table 4-5 shows how the address space is utilized

in the PCIe-Bridge application and how it is determined where the packets are routed.


63

Table 4-5: Address space of the PCIe-Bridge.

Addr

[26:24]

Address space for target device

(Chip select to device)

000 PI device 1 (PI_nCS0)

001 PI device 2 (PI_nCS_SLAVE)

010 PI device 3 (PI_nCS_FPGA)

011 0 SI device 1 (SI_nCS0)

1 SI device 2 (SI_nCS_SLAVE)

100 0 SI device 3 (SI_nCS_FPGA)

1 SI device 4 (SI_nCS_FLASH)

101 000000000000000000000 The Control-module

------------------- Reserved

110 Reserved

111

The three most significant bits determine which module or component is targeted. In the

case of SI the fourth most significant bit also determines which of the SI components is

targeted.

The address space is not fully occupied and additional functionality that requires ad-

dress space is possible.

4.4 Expected performance (RBS)

The worst-case bandwidth of the PCI Express is when the requests are 100 % single

DW writes with 64-bit addressing, such a request takes up 20 bytes and contain 4 bytes

of data. In this case the effectiveness of the PCI Express is 20 %, giving a maximum

bandwidth of 40 MB/s of the PCI Express.

To get an idea of the expected performance of the PCIe-Bridge, only the PI-module and

the SI-module should be taken into account. The other modules will be used for initia-

lizing the PCIe-Bridge and very little during runtime.

The maximum bandwidth of the PI, seen in Figure 4-7, plus the maximum bandwidth

of the SI, seen in Table 4-3 is 32,55 MB/s.

Comparing to the performance of the PI and SI, to the bandwidth of PCI Express, we

see that even if none of the requests are burst requests it is still the PI and SI that is the

limiting factor.

65

5 IMPLEMENTATION OF THE PCIE-BRIDGE

5.1 Implementation of HDL (RBS)

The components in the structural design have been simulated along with their imple-

mentation as far as possible (see section 6 for information on the simulation setup). The

first goal of the implementation was to get the routing of packets to work. To verify that

the routing of packets through the system worked as expected, we implemented a dum-

my interface for emulating the real blocks connected to the interfaces. This dummy in-

terface was built in between the TLP-switch and the multiplexer. The only functionality

of the dummy interface is to pass on the packets from the single input to the single out-

put. This setup can be seen in Figure 5-1.

Ha

rd IP

PC

I Exp

ress m

od

ule

(PC

Ie-m

od

ule

)

The TLP multiplexer

(TLP-Multiplex)

Du

mm

y In

terfa

ce

(Du

mm

y-m

od

ule

)

Du

mm

y In

terfa

ce

(Du

mm

y-m

od

ule

)PCI Express Bridge (PCIe-Bridge)

The TLP switch

(TLP-Switch)

Du

mm

y In

terfa

ce

(Du

mm

y-m

od

ule

)

Du

mm

y In

terfa

ce

(Du

mm

y-m

od

ule

)

No

t co

nn

ecte

d

Figure 5-1: Structural block diagram of PCIe-Bridge for testing routing of packets.

With the dummy interface in place, we tested the ability of looping back the incoming

packet to the PCIe-module.

With the routing of the packets working properly, the next step is to implement each

control unit. The control units are substituted into the testbench separately and tested,

one at the time.

Implementation of the PCIe-Bridge

66

We used the tools Quartus II for implementation and ModelSim for simulation of our

project. In Quartus II there is an add-on called SOPC builder. It was our intention to use

this SOPC builder to make the structural implementation, and then write the Control

modules ourselves in VHDL. It turned out that SOPC builder is not very efficient when

making designs containing user-defined components. We skipped the idea of using

SOPC builder, and instead used the MegaWizard in Quartus II for configuring the PCIe-

module and for configuring all the FIFOs used in our design. See Appendix D for the

source code we have written ourselves. See Table 5-1 for the source files associated

with the different modules of the design.

Table 5-1: Filenames of the implemented modules.

Modules Files

Top level entity PCIe_bridge.vhd

Top level entity of simulation PCIe_testbench.vhd

PI-IRQ-module PI_interrupt.vhd

PI-module PI_control.vhd

PI_FSM_Ctrl.vhd

PI_FSM_Interface.vhd

PI_FSM_Compl.vhd

PI.vhd

VSC74xx_PI.vhd

SI-module SI_module.vhd

SI_control.vhd

SI_ interface.vhd

SI_input_FIFO_logic.vhd

SI_output_FIFO_logic.vhd

CLK_DIVIDER.vhd

SPI.vhd

VSC74xx_SPI.vhd

Control-module control_reg.vhd

Unsup-module unsupported_tlp.vhd

TLP-Switch tlp_switch.vhd

TLP-Multiplex tlp_miltiplex.vhd

PCIe-module for simulation PCIe_module_testbench.vhd

5.2 Design of interface board (JH)

In order to verify the design of the PCIe-Bridge, some real-life testing needs to be done.

For the Cyclone IV GX FPGA, there is no evaluation kit available, that makes it possi-

ble to test the designed PCIe-Bridge against the Luton26 or the Jaguar. Therefore, a

PCIe-Bridge interface board will be designed for the purpose. The interface board will

be made as an add-on board to support the reference boards made by Vitesse for the


67

Luton26 and the Jaguar chips. The reference boards have connectors, on which the pa-

rallel and the serial interface signals of the chips are accessible. The interface board

designed for the PCIe-Bridge will sit on top of one of these connectors. The MPLS

module can be connected using a cable. When the interface board sits on top of a Lu-

ton26 or Jaguar reference board, the PCI Express will have to connect to the external

CPU using a PCI Express cable.

The steps involved in designing an interface board for the PCIe-Bridge are:

Determining what components are necessary for the PCIe-Bridge interface board to

function properly.

Making the schematics, where the components are connected.

Making the layout of the board.

5.2.1 Components

In order to determine what components are necessary on the interface board, it is neces-

sary to take a look at what functionality is necessary for the FFPGA to function proper-

ly.

The FPGA needs to be programmed on power-up. This need demands either a connector

to program the FPGA, or a serial configuration device (EPROM). Connectors for the

reference boards and one for PCI Express also need to be featured. In addition, the

EPROM needs to be configured, and connector is needed for this purpose.

Overall, the necessary components for a PCIe-Bridge reference board may be listed as

follows

FPGA: The Cyclone IV GX FPGA in which the PCIe-Bridge has been imple-

mented.

EPROM: To program the FPGA on power up, and avoid an external system to

stand by at every power up.

JTAG connector: For the possibility of direct programming of the FPGA

EPROM programming connector: For configuration of EPROM devices.

PCI Express connector: Connector to which the PCI Express can connect.

External Oscillator: An 125 MHz clock generator.

Reference board connectors: A connector for a Jaguar reference board, another

for the Luton26 reference board and a third for the MPLS FPGA module.

Power supplies: DC/DC converters and linear regulators to generate the different

supply voltages required by the FPGA.

Supply decoupling: Capacitors to decouple and shunt electrical noise from the

FPGA.

For the JTAG and Altera EPROM connectors, Vitesse have developed an expansion

board for their MPLS board that contains these connectors. This expansion board will

also be used for the PCIe-Bridge interface board to save area, so instead of a JTAG and

an EPROM programming connector, a connector for this expansion board is featured.


68

5.2.2 Schematics

The schematics connect the different components together. An overview of the PCIe-

Bridge interface board is seen in Figure 5-2

Cyclone IV FPGA -

EP4CGX15

EPROM -

EPCS4

Oscillator

Power supplies and

Decoupling capacitors

Connector

For JTAG and

Byteblaster II

board

PCI Expess

connector

Connector

Luton26 board

Connector -

Jaguar board

Connector -

MPLS board

Figure 5-2: Overview of components on the PCIe-Bridge interface board

The FPGA used in the design of the PCIe-Bridge is the smallest in the Cyclone IV GX-

series. The chosen FPGA comes in two different packages. The packages are a 169-pin

Fine Ball Grid Array (FBGA) and a 144-pin Quad Flat No leads (QFN). The FBGA

package uses more area on the PCB than the QFN package, and the FBGA package is

also more expensive than the QFN package. These two factors favor the QFN package,

but it is only the smallest FPGA in the Cyclone IV GX series that comes in the 144-pin

QFN package. This removes the possibility of upgrading the number of logic cells on

the same PCB. The three smallest Cyclone IV GX FPGAs come in the 169-pin FBGA

package. Because of this, the 169-pin FBGA is the package of choice for this add-on

board.

5.2.2.1 Assigning FPGA pins

On the FPGA, the IO signal pins are split into IO banks. It is important, when assigning

IO signals to pins, that signals that are dependent of each other, are assigned to IO pins

in the same IO bank as much as possible. This makes is done in order to make it easier

to meet timing constraints in the design [5]. Because of this, the serial interface, general

purpose IOs and possible additions are placed in one single IO bank, and then the paral-


69

lel interface occupies the rest of the available IO pins making it use up all the other IO

banks available.

How the pins are assigned according to each IO bank can be seen in Appendix B.2.

5.2.2.2 EPROM and Connectors

It now has to be determined more specifically:

What EPROM is needed for programming the FPGA

What connectors are needed, and where they are placed on the PCB

Altera recommend using their own EPROMs with Altera FPGAs. Since this add-on

board will be used mostly for testing, and not be produced in high volume, the extra

price that this may cause will be overlooked. The chosen EPROM is therefore the Altera

EPCS4. This is the smallest EPROM that supports the chosen FPGA [9]. The Altera

EPCS4 is configured through an Altera ByteBlaster II connector.

The connector for the Luton26 board consists of a 2x25-pins male pinrow, and the Ja-

guar connector consists of two 2x25-pins female receptable sockets. These two connec-

tors will be placed on the bottom side of the board. The add-on board will then be able

to sit on top of the connectors on the Vitesse reference boards.

The MPLS module will be connected using a cable. The connector needed for that cable

is two times 2x25-pins male pinrows. These pinrows will be placed on the top side of

the board so they can be accessible by cable. The MPLS connector will be placed on the

top of the PCB where the Jaguar connector is placed on the bottom. This is done so that

signals that connect to equivalent pin on the MPLS connector and the Jaguar connector

can be routed right through the board. This makes the board easier to layout.

The expansion board featuring JTAG and Altera ByteBlaster II connectors is also con-

nected through a 2x25-pins pinrow, which is placed on the top side of the PCB, at the

same location where the Luton26 connector sits on the bottom side.

The PCI Express connector is from Molex. The connector is an 18-pin PCIe x1 cable

connector. The connector is an angled connector placed on the edge of the top side of

the PCB. An overview how the connectors, and the other components, are placed on the

PCB can be seen in Figure 5-3.

5.2.2.3 Power Supply

The power supply for the interface board is received either through the Luton26 connec-

tor or through the Jaguar connector. The interface board supply voltage in can be seen

Table 5-2


70

Table 5-2: The supply voltage for the PCIe-Bridge interface board, for each of the ref-

erence boards.

Vitesse chip

Ref. board

Supply voltage

[V]

Jaguar 3.3

Luton26 2.5

3.3

The Luton26 reference board with a supply voltage of 2.5 V has not been developed yet,

but the possibility must be considered for future applications.

There are three components requiring power supplies. An overview of what these com-

ponents demand is listed in Table 5-3

Table 5-3: Needed supply voltages by components

Component Part of

component

Required

Voltage [V]

EPROM Core 3.3

Oscillator Core 3.3

FPGA

Core 1.2

PCIe Hard IP 2.5

IO banks PCIe-Bridge

Supply voltage

Table 5-3 shows that there are three voltage levels required for the components on the

PCIe-Bridge reference board. When the input supply voltage is 3.3 V, the 2.5 V supply

is made using a linear regulator. A linear regulator is not an effective way to transform

voltage levels, but since the current supplied to the PCIe Hard IP is low (Appendix B.4)

and the span the voltage is transformed is relatively low, the loss in the linear regulator

is acceptably low [10]. When the interface board supply voltage is 2.5 V, the linear reg-

ulator is bypassed.

The 1.2 V supply is acquired using a DC/DC step-down converter. The DC/DC conver-

ter is used because it is more effective than the linear regulator, and because the core of

the FPGA consumes more power than the PCIe Hard IP (Appendix B.4). This makes

the need for an effective power supply bigger.

When the supply voltage for the interface board is 2.5 V, a DC/DC step-up converter is

used to supply the components requiring 3.3 V.

To ensure that electrical noise that may come from the FPGA is not allowed to affect

the whole system, some decoupling capacitors of the supplies is needed. The number of


71

capacitors needed was found by Vitesse workers. The decoupling and the power sup-

plies can be seen in the schematics in Appendix B.1.

Overall, the PCIe-Bridge interface board will look, something like that depicted in Fig-

ure 5-3.

FPGA

EP-

ROM

Ma

le –

MP

-LS

Ma

le –

MP

-LS

OSC

PCIe

FPGA

(Bottom)

Fe

ma

le –

Ja

g-u

ar

Fe

ma

le –

Ja

g-u

ar

Ma

le –

Lu

t-o

n2

6

Linear

regulator

1,2V

supply

Top BottomF

em

ale

- E

xp

an

sio

n

bo

ard

3,3V

supply

Area available

for decoupling

capacitors

Figure 5-3: An overview on how the components will be placed on the PCB.

The area used by components in Figure 5-3, is the actual area used relative to each oth-

er. The space between the two connectors making the Jahuar connectors also fits that of

the Jaguar reference board. The actual size of the PCB though is not fixed, it may

change during layout where area may be saved.

The total cost for this reference board can be seen in Appendix B.3.

5.2.3 Layout

The layout for the PCIe-Bridge reference board has not been made. This task was to be

made by engineers at Vitesse, but since the Cyclone IV FPGA could not be delivered in

time for testing and verification, the decision to not make the layout and order the refer-

ence board was made in order to meet the deadline of the project.

So a customer opting to go for the PCIe-Bridge interface board needs to layout the

board before being able to apply it.

When produced, this PCIe-Bridge interface board can be used to test and verify the im-

plemented functionality of the PCIe-Bridge.

73

6 TESTING AND VERIFICATION

6.1 Simulation (RBS)

As explained in section 5.1 the simulation has been done in parallel with the implemen-

tation and the testbench has been extended continuously as the need for more testing

functionality arose.

6.1.1 Structure of testbench

When the PCIe-module is instantiated in MegaWizard there is a testbench included, this

testbench works by sending chaining DMA requests through the PCI Express interface.

We have tested the PCIe-module with the included testbench to make sure that the PCI

Express module works. There are two problems with this testbench. First, there is no

documentation of how to alter the requests that are sent. Second, it is a very large test-

bench, which takes a long time to simulate. Instead, we have made our own testbench

that does not communicate through the PCI Express but communicates through the Ava-

lon-ST interface. This simplifies the testbench and enables us to control the packet.

In the top module of the testbench, the PCIe-Bridge and the non-synthesizable models

of the parallel interface and serial interface are instantiated. Inside the PCIe-Bridge, the

auto-generated PCIe-module is replaced with a PCIe-module_testbench. The block dia-

gram of the testbench is shown in Figure 6-1.

Testing and verification

74

PCIe_testbench.vhd

SP

I.vh

d

VSC74xx_SPI.vhd

VSC74xx_SPI.vhd

VSC74xx_SPI.vhd

VSC74xx_SPI.vhd

PI.vh

dVSC74xx_PI.vhd

VSC74xx_PI.vhd

VSC74xx_PI.vhd

PCIe_Bridge.vhd

PI_

co

ntr

ol.vh

dS

I_co

ntr

ol.vh

d

PC

Ie_

mo

du

le_

testb

en

ch

.vh

d

Input File

Output File

Figure 6-1: Block diagram of testbench.

The PCIe-module_testbench has the same inputs and output as the PCIe-module, the

only signals that is used by the PCIe-module_testbench is the 64-bit Avalon-ST input

and output, all other signals are left open or set to zero. Inside the PCIe-

module_testbench, there is a file reader and a file writer. The file reader reads the input

file and each line of the input file is set on the Rx-port in one clock cycle. The file writer

writes the data from the Tx-port to a file. The structure of the packets in the test files

can be seen in Appendix C.1.

6.1.2 Methodology of testing

Two different test methods are used to test the PCIe-Bridge. The first test method is for

testing the routing of packets through the PCIe-Bridge. The second test method is used

to test the functionality of each module of the PCIe-Bridge.

To test the routing of packets, the setup of the PCIe-Bridge is as seen in Figure 5-1. In

this test, TLPs are sent through the PCIe-Bridge. To verify a test of this type, the output

packets are compared to the input packets, and the route of each TLP is compared to the

expected route. In a successful test the input and output packets should be identical, and

arrive at the output in the same order as they were transmitted from the input.

When testing the functionality of each module in the PCIe-Bridge, the setup of the

PCIe-Bridge should be as in Figure 4-1. To verify the test the input packets are com-

pared to the output packets. In a successful test, all the requests sent through the input

should return a correct completion to the output. The PCI Express interface support split


75

transactions and each packet has its own tag and therefore, it does not matter in which

order they are received at the output.

All test files has been listed in Appendix C together with their expected output files.

With this approach, it will be easier to make changes to the VHDL code and check that

no unexpected behavior has been introduced to the PCIe-Bridge.

6.1.3 Verifying packet routing

It is only possible to test 6 of the output ports of the TLP-Switch because the TLP-

Multiplex only has 6 input ports. To verify that the routing of packets is done in the cor-

rect way, we use the testing method described in section 6.1.2. Furthermore, the routing

should be tested under maximum load, without breaks between the packets. The binary

files from the performed tests can be seen in Appendix C.2-C.3. The result showed that

the input and output packets were identical and arrived at the output in the same order as

they were transmitted from the input. The packets should be directed out of the TLP-

Switch ports 1 – 6, in incrementing order. A wave diagram of the StartOfPacket signals

of the TLP-Switch output ports, can be seen in Figure 6-2.

Figure 6-2: StartOfPacket signals of TLP-Switch, while testing routing of packets.

In the wave diagram it is seen that the StartOfPacket signals go high in the right order.

This test was performed under maximum load with no spare clock cycles in between

packets. It is hereby verified that the packets are routed correctly through the system.

6.1.4 Verifying functionality of modules

For testing and verifying functionality in the different modules, we use the testing me-

thod described in section 6.1.2. To test the functionality of the PCIe-Bridge we list the

testable functionality in Table 6-1.


76

Table 6-1: Testable functionality of modules in the PCIe-Bridge.

Module name Testable functionality

PI-IRQ-module Assertion/Deassertion of each interrupt pin, multiple at the time.

PI-module Read/write & burst/non-burst & QW/non-QW aligned address &

posted reads to all switch chips at the same time

SI-module Read/write & burst/non-burst & QW/non-QW aligned address &

from all the different chips

Control-module Read/write

Unsup-module Return unsupported request completion

In the following sections, we go through the testing and verification of the different

modules.

6.1.4.1 PI-IRQ-module

It is the switch chips that drives the interrupt pins and therefore, there will only be sent

packets from the module. The interrupt pins are asserted and deasserted in the simulat-

ing model of the 3 parallel interfaces in the testbench. To test the functionality of the PI-

IRQ-module the interrupt pins are asserted and deasserted, one at the time and after-

wards multiple at the time, all assertions and deassertions should result in the PI-IRQ-

module sending a message TLP, the interrupt pulses are 10 ns wide. These message

TLPs can be accounted for in the output file. In Figure 6-3 a timing diagram of the in-

terrupt pins in the test of the PI-IRQ-module is shown.

Figure 6-3: Timing diagram of interrupt signals, when testing PI-IRQ-module.

First each of the interrupt pins are asserted and deasserted one at the time and after-

wards all the interrupt pins are asserted at once and then desserted at once. In Table 6-2

the message codes for each of the received message TLPs are shown, the file from this

test can be seen the Appendix C.4-C.5.


77

Table 6-2: Results of PI-IRQ-module test.

Message code Message name

0x23 Assert_INTD

0x21 Assert_INTB

0x20 Assert_INTA

0x22 Assert_INTC

0x24 Deassert_INTA

0x25 Deassert_INTB

0x26 Deassert_INTC

0x27 Deassert_INTD

0x20 Assert_INTA

0x21 Assert_INTB

0x22 Assert_INTC

0x23 Assert_INTD

0x24 Deassert_INTA

0x25 Deassert_INTB

0x26 Deassert_INTC

0x27 Deassert_INTD

It is seen in the results that all the message TLPs arrive at the output. It is hereby veri-

fied that the PI-IRQ-module is working properly.

6.1.4.2 PI-module

The PI-module should be able to execute requests from the 4 different queues and it

should be possible to have a posted read in progress on each of the 3 chips at the same

time. The non-posted reads should support burst requests. To ensure that the data is

aligned correctly, all of these requests should be tested with and without a QW-aligned

address.

The data returned by the PI is the address shifted 1 to the right. This enables us to check

if the address is incremented correctly. Through an 8-bit parallel interface, the data from

every other read should increment by 1. Through a 16-bit parallel interface, the data

from every read should increment by 1.

All tests should be run with 8-bit and 16-bit parallel interface. The binary test files for

these test are shown in Appendix C.6-C.8. All requests returned a correct completion. It

is verified that the PI-module, can execute the requests that it should support. To verify

that the PI-module have the performance characteristics that it was designed to have,

there is a need for another test. We run a series of tests with different composition of

posted reads and non-posted reads, the test files for these tests can be seen in Appendix


78

C.9-C.15. These tests are plotted together with the graph of the maximum bandwidth for

at shared bus and with 3 posted requests in progress at the time, in Figure 6-4.

Figure 6-4: Measured bandwidth of the implemented PI-module, as a function of

posted reads. When the all the 3 PIs are 16-bit wide.

In the performance test, the posted reads to each chip are equally distributed, to get the

maximum bandwidth of the PI-module. The measured values are a little lower than the

theoretical values, this is because the test only executes around 10 requests, the meas-

ured values would be closer to the theoretical values if more request where executed.

We see in Figure 6-4 that the measured values have the same tendencies as the theoreti-

cal values. It is hereby verified that the PI-module is working properly

6.1.4.3 SI-module

The SI-module should support the same request as the PI, except for posted reads. The

test files used to test the SI-module is seen in Appendix C.16-C.17.

All the requests have a matching and correct completion. It is hereby verified that the

SI-module is working properly.

6.1.4.4 Control-module

To test the control module, there should be a write TLP and a read TLP in the input file,

to the same control register. In a successful test the read TLP should return the same

data as just written by the write TLP. A limitation of this module is that it does not sup-

port burst reads and writes. To ensure that the module does not need a reset because a


79

burst read or write have been sent, the same test as described above should be done after

a burst read and a burst write has been sent. In Appendix C.18-C-19 the test result is

seen. For all the supported requests there were an correct completion with the expected

data. It is hereby verified that the Control-module is working properly.

6.1.4.5 Unsup-module

The Unsup-module should catch all TLPs with types other than memory requests. For

each caught TLP the Unsup-module should return an “unsupported request” completion.

To test the Unsup-module, TLPs with all possible types are sent and for all the unsup-

ported requests, a correct completion should be returned. See Appendix C.20-C.21 for

the input and output files of this test. All the completions have the completion status of

unsupported requests and the requester ID and completer ID is set correctly. It is hereby

verified that the Unsup-module is working properly.

6.2 Synthesis

6.2.1 Hardware usage (JH)

The synthesis report gives an overview of total amount of hardware used on the FPGA

to implement the PCIe-Bridge. To make sure that the use of hardware is expected, the

components must be investigated to see whether the hardware usage found in the syn-

thesis report is the same as expected from the HDL implementation.

When synthesizing the PCIe-Bridge, the synthesis report gets the total hardware usage

of the design to be as seen in Table 6-3

Table 6-3: Total hardware usage of the PCIe-Bridge design on FPGA

Resources Usage

LC combinationals 4175

LC registers 2943

Memory bits 55136

Pins 64

An LC register is a 1-bit register. The FPGA has 14400 logic cells (LC) and 552960

memory bits, so there is no problem for the FPGA to hold the designed PCIe-Bridge,

when it comes to area.

For the pins, 60 are counted for in Table 3-3. The PCIe interface accounts for 3 pins,

and serial interface has two output pins for the serial clock, increasing the number of

pins used by the serial interface by one. From this, the 64 pins are all accounted for.

The only components using the Memory bits are the FIFOs. Not all of the FIFOs though

are implemented using block ram, some FIFOs are made using registers to increase per-

formance. The FIFOs all have a data width of line being 66-69 bits. The FIFOs vary in


80

depth, and especially the input FIFO for the SI module is deeper than the rest. An over-

view of the FIFOs used and their expected memory usage is seen in Table 6-4.

Table 6-4: Expected memory bit and register usage of implemented FIFOs

FIFO compo-

nent name

# of

FIFOs

FIFO

data

width

FIFO

data

depth

Bit

usage

pr.

FIFO

Memory

bit

usage

Register bit

usage

(For queue)

Output_FIFO 6 67 64 4288 25728 0

PI_fastFIFO 1 69 128 8832 8832 0

PI_slowFIFO 4 67 32 2144 8576 0

SI_input_FIFO 1 66 128 8448 8448 0

SI_output_FIFO 1 66 64 4224 4224 0

Unsup_FIFO 2 66 4 264 0 528

Switch_FIFO 1 66 4 264 0 264

Total 16 55608 792

The expected number of memory bits exceeds the actual number memory bits given in

Table 6-3 by just under 500 bits. This indicates that there are some unused bits in some

FIFOs that are synthesized away. The total use of memory though is in the region of

what is expected.

The synthesis report also shows, that each FIFO and the PCI Express module use

around 40-45 LC combinationals and 30 LC registers. This means that the FIFOs and

the PCIe-Module together will use close to 1200 LC registers.

Looking at the register use for the PI FSM control in the PI module, the header of a TLP

will be saved to registers when executing a posted read from each of the three designat-

ed FIFOs. In addition, the header that is being executed will be saved to registers result-

ing in a total of four TLP headers being saved to registers. For this purpose then, there is

a need for 16 32-bit registers resulting in 512 LC registers. In addition, the PI FSM con-

trol saves some data and holds some counters in registers, resulting in somewhere

around 620 LC registers in total for the PI FSM control.

An analysis over all the LC registers for each component in each module is done. The

expected LC register usage is seen in Table 6-5.


81

Table 6-5: Expected LC register usage for the PCIe-Bridge design

Module Comonent Number of LC

registers expected

PI-Module

PI FSM Control 620

PI FSM Completion 90

PI FSM Interface 25

PI Interrupt 25

SI-Module

SI FSM Control 300

SI FSM input and

SI FSM output

10

SI FSM interface 50

TLP Switch 150

TLP Multiplexer 10

Control registers and GPIOs 600

Unupported TLP 35

Dummy Interface 5

Registers from FIFOs and PCIe module 1200

Expected total of LC registers 3120

The expected total number of LC registers is around 200 more than the actual number of

LC registers. This is difference in LC registers may be found in the synthesis process,

where the synthesis tool optimizes the design and by doing that saves some LC regis-

ters.

Altera Quartus II does not give a detailed description of what combinational logic has

been used and where it has been used. Because of this, the number of LC combination-

als will not be investigated further.

The hardware usage of the PCIe-Bridge on the FPGA is as expected and fits on the cho-

sen FPGA.

6.2.2 Meeting the timing constraints (RBS)

When synthesizing the PCIe-Bridge the timing constraints were not met at first. After

making alterations in the code to minimize the levels of logic in paths that did not meet

the timing constraints, the timing was still not closed. The paths that did not meet the

timing constraints, were mainly paths routed to and from FIFOs. At this stage all the


82

FIFOs were implemented in block ram, so to reduce the timing lost in routing, we in-

structed Quartus II to implement some of the FIFOs in registers instead of block ram.

This makes the placing and routing of the design more flexible and it made Quartus II

able to close the timing.

In the PI-module there have been inferred some multiplexers, to handle the requests

from the four different queues. These multiplexers showed up in many of the critical

paths around the bridge. The alterations we made to reduce the critical paths were main-

ly around these multiplexers, some of the alterations involved minor changes in the data

flow of the design, and others were solved by optimizing the way the code was written.

One of the changes we made was to pass on the chip address, used to determine on what

chip each request should be executed on, from the TLP-Switch to the PI. This change

made in unnecessary to interpret the chip address again and thereby saving a multiplex-

er.

Finally we managed to meet the timing constraints of the PCIe-Bridge on the slowest

speed grade of the Cyclone IV chip series - the EP4CGX15BF14C8 - which also helps

to reduce the component cost of the PCIe-Bridge.

83

7 CONCLUSION

7.1 Results

In this project we have designed and implemented a working PCIe-Bridge, as proposed

by the original project description from Vitesse. We have achieved that by seeking an-

swers to the five questions we asked in section 1.2 and we have used the questions to

structure our work and this report.

The functional requirements for our PCIe-Bridge were defined as a compromise be-

tween the full blown set of functions Vitesse could wish and what could realistically be

achieved by two bachelor students. From the investigation of the requirements, we

found that the PCIe-Bridge should implement all three layers of a PCI Express endpoint

and support the split transaction capability of the PCI Express interface. Furthermore,

the PCIe-Bridge should be able to interpret the packet-based requests from the PCI Ex-

press interface, and process them through the parallel interface or the serial interface.

The implementation should be a compromise between cost and functionality.

From the identified requirements to the PCIe-Bridge, we derived that the needed FPGA

should have an IP core for PCI Express communication and enough I/O pins for a

shared parallel interface bus. The shared bus of the parallel interface is a compromise to

minimize the number of used pins, and thereby keeping the price of the device low. The

Altera Cyclone IV GX (EP4CGX15BF14C8) FPGA turned out to be the best-suited

FPGA for the job.

A good way of designing the PCIe-Bridge for implementation in the Altera Cyclone IV,

is to use the PCIe IP core with Avalon-ST interface that can be instantiated in Quartus

II. The Avalon-ST interface makes use of the split transaction capability of PCI Ex-

press, making it possible to execute multiple requests in parallel. Executing multiple

requests in parallel enables us to have posted reads in progress on all switch chips while

executing fast reads and writes, and at the same time execute requests through the serial

interface. By making use of the split transaction in the design, we obtained full utiliza-

tion of the PI and SI bandwidth. We made considerable effort to arrive at a structural

design of the PCIe-Bridge, which is both intuitive and modular, and hence straight for-

ward to implement, test and maintain.

Conclusion

84

The designed components of the PCIe-Bridge have been implemented in VHDL and we

took our role as product developers serious by striving to reach the best quality of code,

well documented and tested. We produced approximately 6800 lines of VHDL source

code, which is well suited to be the base of further extensions of the PCIe-Bridge func-

tionality.

Schematics for a PCIe-Bridge add-on board have been made and are ready to be laid

out. The add-on board can be used with the Jaguar and Luton26 reference boards.

We have made a testbench that enables individual and independent testing and verifica-

tion of each module of the PCIe-Bridge. The individual module is tested and verified by

sending TLPs, which make use of the implemented functionality of each module.

The PCIe-Bridge has not been tested on chip, due to the lack of time.

When we look at how far we gotten in this project, we think that we have been success-

ful with selecting a realistic set of requirements for our version on the PCIe-Bridge. We

believe that when the chosen FPGA becomes available, the PCIe-Bridge add-on board

is produced and the design is tested on board, our design is close to a functional product.

7.2 Perspectives

With the developed component, the Vitesse customers can now manage and collect sta-

tistical data from the Vitesse switch chip through the widely used PCI Express interface,

with little extra cost. This gives the Vitesse switch chips an advantage because they can

now be connected to an external CPU either through a parallel interface or through a

PCI Express interface. When the add-on board is produced, it can be sent out to custom-

ers together with the reference systems of the Jaguar and Luton26. This enables the cus-

tomers to select the type of external CPU they like, making the Vitesse switch chips

more favorable in the market.

If Vitesse choose to integrate a PCI Express endpoint in their future switch chips, they

will be able to test the software for external CPU communication with this PCIe-Bridge

add-on board. Furthermore, with the PCIe-Bridge, Vitesse will be able to get an idea of

the pros and cons of interfacing to the switch chips through PCI Express.

Conclusion

85

Another perspective of the PCIe-Bridge is to use it in a backplane switch. A backplane

switch would have a single CPU for controlling several expansion cards, in these expan-

sion cards there could be a PCIe-Bridge and a number of switch chips. A figure of the

setup can be seen in Figure 7-1.

BACKPLANE SWITCH

PC

Ie link

PC

Ie li

nk

CPU

Expansion card

PI

Bridge

FPGA

PI

PCIe link

Expansion card

PI

Bridge

FPGA

PI

Expansion card

PI

Bridge

FPGA

PI

Figure 7-1: Block diagram Backplane switch.

The figure only shows two switch chips on each expansion board, but the PCIe-Bridge

can handle up to three.

Because the PCI Express devices are memory mapped it would make it easy to write the

software for such a Backplane switch and the bandwidth of the PCI Express is easily

scalable to fit the size of the Backplane switch.

7.3 Further work

Before this product is released to customers, it would be a good idea to layout the de-

signed add-on board and test the PCIe-Bridge with the Vitesse switch chip reference

systems.

Conclusion

86

During the implementation work, it became clear that two features would be useful to

have in a next release of the PCIe-Bridge:

Automatic bypass of packets in the TLP-Switch.

Reset functionality of the PCIe-Bridge through the PCI Express interface.

The automatic bypass should reject packets that are stuck in the TLP-Switch, because

one of the interface modules malfunctions. This gives way for a reset packet to the Con-

trol-module.

87

REFERENCES

[1] http://zone.ni.com/devzone/cda/tut/p/id/3767, PCI Express – An Overview of

the PCI Express Standard, National Instruments, 2009

[2] http://www.acm.uiuc.edu/sigops/roll_your_own/7.c.0.html, Config header for-

mat

[3] Ravi Budruk, Ron Anderson and Tom Shanley, PCI Express System Architec-

ture, MindShare, 2004

[4] http://www.altera.com/literature/ug/ug_pci_express.pdf, Altera PCI Express

Complier User Guide, Altera, 2010

[5] Personal Communication, Thomas K. Jørgensen, Vitesse, March 2010

[6] http://www.altera.com/literature/an/an486.pdf, Serial peripheral interface in

Max II CPLDs, Altera, 2007

[7] Personal Communication, Jørgen Abrahamsen, Vitesse, April 2010

[8] http://www.altera.com/corporate/news_room/releases/2010/products/nr-

civships.html, Altera Rolls Out Production Shipments of Low-Cost, Low-Power

Cyclone IV FPGAs, Altera, 2010

[9] http://www.altera.com/products/devices/serialcfg/overview/scg-overview.html,

Serial Configuration devices overview, Altera

[10] Personal Communication, Martin Galster, Vitesse, April 2010

http://zone.ni.com/devzone/cda/tut/p/id/3767

http://www.acm.uiuc.edu/sigops/roll_your_own/7.c.0.html

http://www.altera.com/literature/ug/ug_pci_express.pdf

http://www.altera.com/literature/an/an486.pdf

http://www.altera.com/corporate/news_room/releases/2010/products/nr-civships.html

http://www.altera.com/corporate/news_room/releases/2010/products/nr-civships.html

http://www.altera.com/products/devices/serialcfg/overview/scg-overview.html

pcie to parallel interface bridge in low-cost fpgaetd.dtu.dk/thesis/264828/bac10_33.pdf · pcie to...

Documents