the institute of electronics, communications and information technology reconfigurable architectures...

The Institute of Electronics, Communications and Information Technology

Reconfigurable Architectures for High Bandwidth Network Processing Systems

Professor John McCanny CBE FRS FREng

Dr Sakir Sezer, Dr Maire McLoone

Institute of Electronics Communications and Information Technology

You see things and say “Why?” but I dream things

that never were and say “Why not?”

George Bernard Shaw

Purpose of Talk

• An overview of Research on reconfigurable architectures for Network Processing applications

• Three aspects – Node throughput

– QoS

– Data security

Structure of talk

• Convergence of Communication systems

• Processing demands of future networks

• Trade-offs of reconfigurability in network processing in the context of Application specific architectures for

– Programmable Data-Link layer Datagram Processing

– Programmable Packet scheduling Architectures

– Configurable Cryptographic Architectures

• Conclusions

Convergence of Communication and Information Systems

WIMAX SAT, Optical

4GMobile IP

Internet Based Global Communication

Best effort

Services

Real Time Interactive Services

TelecommunicationBroadcast, Multicast

VoD, TV, RadioComputer Communication

Convergence of Technology, Applications and Services

FixedX-DSL

Mobile3GUMTS

CurrentFuture

10 Kbps

100 Kbps

1 Mbps

10 Mbps

100 Mbps

WLAN

Bandwidth Demand Vs Moore’s Law

Existing data processing architectures are unable to keep up with network processing demands !!

Moore’s Law Silicon Integration

Capabilitydoubles every

18 Months

Internet Traffic doubles every

12 Months

TechnologTechnologyy

GapGap

Data processing demand at network- and access

nodes doubles every 6-9 Months Network

ProcessingGap

Issues

• Internet traffic is continuously doubling every 12 months

• Emerging services require:– Higher bandwidth (VoD, DVB-IP, VoIP)– Higher degree of security (Internet Banking, internet shopping, e-

business)

• Network processing demands a consequence of – Smaller packet sizes of real-time and interactive services – QoS requirements of real-time and interactive services – Complex security processing of sensitive data– Network protection from viruses and intruders

Network Processors Architectures -High performance with flexibility

To cope with exponential growth in bandwidth demands

Complex traffic profiles and heterogeneity of service

To efficiently utilise resources by dynamically adapting network nodes to various traffic patterns

Capability for on-demand and customised QoS support

Cost effective upgrades to new communication protocols

Ideally high levels of compute power, high levels of flexibility

Application Specific, Configurable Network Processing Architectures

• Programmable Data-Link layer Datagram Processing– Frame Delineation– Frame Check Sequence

• Programmable Packet scheduling Architectures– Logic-level reconfigurable packet scheduling architecture– System-level configurable packet scheduling architecture

• Configurable Cryptographic Architectures– Iterative and Non-iterative architectures

Data-Link Layer

Protocol processing

Data-Link layer Protocol processing

Data Link Layer protocols enables a point-to-point connection between two peers over a physical link.

F H Payload FCS F Line Card

e.g. Ethernet Card

or ADSL Modem

Client or Router

e.g PC

Router

Physical layer

Data Link Layer

Network Layer

Modulation, Error Correction

H Payload

PHY PHY

Internet Protocol (IP) Packet

Ethernet, ESCON or PPP Frame ∙∙∙0100110101001010011∙∙∙

Raw Bit Stream

Line Card e.g.

Ethernet Card or

ADSL Modem

Physical Medium (wired or wireless) e.g. Optical Fiber, Copper (coax, twisted pair, telephone line), free air

F: Frame delimiter H: Frame Header

FCS: Frame Check sequence

Common Data Link Layer protocols are: ATM, Ethernet, PPP, GFP, Frame Relay, Fibre Channel etc.

IP over SDH/SONET Data Link Layer Protocols

SDH/SONET

Internet Protocol

EthernetBridge, VLAN

PPP GFPFrameRelay

ATM(MPOA, MPLS,..)

Network Layer

DataLink Layer

PHY Layer

Legacy Protocols Emerging Protocols

Data-link layer frame processing

• Frame processing involves two key functions: – frame delineation and – frame check sequence (FCS)

• The circuit architectures for both functions are determined by– the protocols and – the data-path width

• Numerous Frame Delineation and FCS architectures for PPP, ATM and GFP investigated.– Scalability– Throughput– Hardware costs

• Programmable frame processing architecture is desirable to support a variety of protocols

Data-Link Layer

Protocol processingFrame Delineation

PPP 32-bit ACCM Transmitter Circuit

32 Bit

XOR

XOR

XOR

XOR

= 7D or 7E Comparator

64 Bit

64 Bit

Data Reorganiser – Control Unit

Input Buffer Flag

32 Bit

32 Bit

Feedback Buffer Flag

64 Bit

7E

8 Bit

ReadFlag

Input Buffer

Feedback Buffer

Output Buffer

Data Status

From CRC

Generator Output FIFO

Includes Asynchronous-Control-Character-Map (ACCM) function.

PPP Frame Delineation CircuitPost-layout Synthesis - Altera Stratix II

PPP ACCM Transmitter PPP ACCM Receiver

Area Speed Area Speed Data-Path

Width ALUTS Registers

ALMs

Speed (MHz)

Data Throughput

(Mbps) ALUTS Registers

ALMs

Speed (MHz)

Data Throughput

(Mbps)

8-Bit 18 14 10 251.09 2008.72 19 6 16 256.21 2049.68

16-Bit 123 87 72 273.29 4372.78 84 62 46 268.9 4302.4

32-Bit 368 191 207 258.59 8274.88 363 165 196 263.92 8445.5

64-Bit 1463 466 873 173.16 11082.24 1740 338 929 161.23 10318.72

128-Bit 15231 1129 9592 44.69 5720.32 13985 764 8686 36.43 4663.04

Hardware Complexity O(N)=N2

8 bit and 32 bit Data Paths

32 bit data path requires additional hardware for rearrangement of data words before and after transmission.

Scaling involves a significant area penalty.

Complex data reorganization circuits designed to overcome the limitations set by an octet based protocol

Requires an increased logic cost by factors of 15 and 26 for the ACCM receiver and the ACCM transmitter circuits respectively.

Majority of logic increase due to the number of byte comparators, as well as the provision of extra routing and the conditional multiplexers

• ATM Frame – 5 byte header, 48 byte payload

• Based on Cyclic Coding

• Header Error Check HEC

– Cyclic Redundancy Check (CRC)

– Provides header error detection and frame delineation

– 5th header byte (HEC) calculated from CRC computation of 1st 4 header bytes

– CRC polynomial G(x) = x8+x2+x+1

ATM Frame Delineation

GFC[UNI]/VPI VPI

VPI VCI

VCI

VCI PT CLP

HEC

8 Bits

ATM Header CRC Computation

G(x) = x8+x2+x+1

First 4 header bytes

ATM Bit-by-Bit HEC HUNT 4-Bit Data-Path Architecture

Bit 1 Frame Boundary Check

Bit 0 Frame Boundary Check

CRC Calc

2

CRC Reset

Data In

Data Buffer Pipeline

CRC Calc

1 XOR A

XOR A

XOR A

4 8

CRC Calc

8

CRC Calc 10

CRC Calc

9 XOR A

XOR A

XOR A

4 8

CRC Calc 16

Compare

Data Out 4

4

Compare

8 cycles of Data

Compare with next 2

nibbles

Enable O/P if Match

Reset CRC Unit

4-Bit

8-Bit

16-Bit

32-Bit

64-Bit

Logic Cells

402

955

1159

1274

2856

Registers

202

348

370

386

706

Clock Frequency (MHz)

171.97

204.42

160.41

127.94

106.39

Data Throughput (Mbps)

687.88

1635.36

2566.56

4098.08

6808.96

4, 8, 16, 32 and 64 bit implementations

Altera Stratix Technology

16 bit design - 2.5 Gbps supports SONET OC48 line rate 64 bit design - 6.8 Gbps

Generic Frame Procedure (GFP)

• The Generic Frame Procedure is a Layer-2 framing protocol for data over high-capacity optical networks.

• Recently standardised (ITU-T G.7041) to replace ATM and PPP in high capacity Wide Area Networks (WANs)

• GFP is scalable, allowing the implementation of wide data-path architectures.

• GFP deploys a CRC based frame delineation architecture similar to ATM HEC HUNT and synchronisation technique

GFP Frame Structure

Tx Bit Order

Tx B

yte Ord

er

Core Header

Payload

Area

PLI

cHEC

Payload

Payload Header

Payload Type LSB

tHEC MSB

tHEC LSB

Optional

Extension

Header

0 – 60 Bytes

Payload Type MSB

PTI

UPI

EXIPFI

Optional pFCS

pFCS [31:24]

pFCS [7:0]

pFCS [15:8]

pFCS [23:16]

CID

Spare

eHEC MSB

eHEC LSB

Example:

Linear Extension Header

16-bit GFP Core Header Error Check (CHEC) field is used for frame delineation

GFP Frame Delineation 64-bit Datapath with 1-bit Header Error Correction Circuit

64

Data Buffer 1

Data Buffer 2

Data Buffer 3

Data In

CRC Calc 1 16

16

16-Bit Comparator

16

16-Bit Comparator

16

16-Bit Comparator

64 64

Data Out

88-Bit in 64-Bit out MUX

8-Bit Latch

16

16-Bit Comparator

16

16

16

Bits 0-7

Bits 8-15

Bits 16-23

Bits 24-31

16

16

16

16

8 Byte Window Gate

Bits 0-7

Bits 8-15

Bits 16-23

Bits 32-39 Bits 32-39

Bits 40-47

Bits 48-55

Bits 56-63

CRC Calc 2

CRC Calc 3

CRC Calc 4

CRC Calc 5 16

16

16-Bit Comparator

16

16-Bit Comparator

16

16-Bit Comparator

16

16-Bit Comparator

16

16

16

16

16

16

16

CRC Calc 6

CRC Calc 7

CRC Calc 8

Bits 24-31

Bits 40-47

Bits 48-55

Bits 56-63

Frame Synchronisation State Machine

Payload Counter

Bits 16-31 PLI Field

Single Bit Error

Correction Look up Circuit

Enable & Control

Enable

Enable

Max CLK: 165 MHzALUTs: 1107Register: 653ALMs: 751LABs: 149Throughput: 10.5 Gbps

Preliminary Design study

Altera Stratix II-3 FPGA Technology

Cadence Encounter – UMC-130nmClock frequency: 250 MHz

Total area: 0.12 mm2

Throughput: 16.0 Gbps

Total-power: 1.6x10-02 Watts

Internal-power: 1.4x10-02 Watts

Switching-power: 2.3x10-03 Watts

Leakage-power: 8.1x10-05 Watts

GFP Frame Delineation 64-bit Datapath with 1-bit Header Error Correction Circuit

UMC-130nm Reference Design

Fastest implementation in the literature

Programmable Frame Delineation

ATM Frame Delineation Circuit

GFP Frame Delineation Circuit

Protocol Select

CLK

32 Data In Data Out

32

Data Status

PPP Frame Delineation Circuit

Ethernet Frame Delineation Circuit

Protocol Select Enable

Programmable Data-Path PPP/GFP/Ethernet/ATM

Frame Delimiter Data path 32-Bit 64-Bit

ALUTs 1515 4183 ALMs 1099 2812 LABs 209 490

Registers 768 1523 Clock Freq

(MHz) 153.9 99.9

Data Throughput

(Mbps) 4926 6394


Target 10Gbps not achievable in FPGA Technology, should be with ASIC

32-Bit Protocol Processing Circuit Decomposition

32-Bit Protocol Buffer

Registers Muxes CRC

(XOR gates) Comparators Protocol State

achine Error Correction Look Up Circuit Counter

PPP(RxD) 2 4 byte registers

2 4 byte muxes 0

8 8-bit comparators with constants 0 0 0

PPP(TxD 2 8 byte registers

2 8 byte muxes 0

8 8-bit comparators with constants 0 0 0

GFP 2 4 byte registers

1 7-byte in 4 byte out

4 16-bit in/16-bit out CRC matrices

4 16-Bit comparators

tri-state synchronisation

32 16-Bit Comparators

PLI Counter

352 XOR gates state machine with fixed constants

88 in each 16*16 matrix

ATM (byte-by-Byte)

2 4 byte registers

1 7-byte in 4 byte out

4 32-bit in/8-bit out CRC matrices

4 8-Bit comparators

tri-state synchronisation 0

48 byte counter

440 XOR gates state machine

26 in each 8x8 matrix (*4*4) + 3*8 XOR

Ethernet 2 4 byte registers

1 7-byte in 4 byte out 0

8 8-bit comparators with constants 0 0

3- bit counter

Programmable ATM/GFP Protocol Frame Delineation Architecture

GFP HEC Calculation

CRC Calc 1

CRC Calc 2

CRC Calc 4

CRC Calc 3

ATM HEC Calculation

CRC Calc 1

CRC Calc 2

CRC Calc 4

CRC Calc 3

32

Data Buffer 1

Data In

Control Unit [Dual State Machine]

=

=

=

=

32

Data Out

32

GFP Single Bit

Error Correction

Look up Circuit

Payload Counter

Enable & Control

4 Byte Window

7 byte in - 4 byte out Data-

Path Mux

Protocol Select

Data Status

CRC Engine Enable

8-bit/16-bit comparator configure

Configure Counter [PLI filed or 48 bytes]

ATM Frame Delineation Circuit

GFP Frame Delineation Circuit

32 Data In

Data Out 32

Data Status

Protocol Select

Protocol Select Enable

CLK

Shared Common Elements Separate Data Path

Area Speed Programmable ATM/GFP Frame

Delimiter ALUTs Registers ALMs LABs Clock Frequency (MHz) Data Throughput (Mbps)

Common Elements 885 387 530 78 165.65 5300.8

Separate Data-Path 872 531 621 124 159.46 5102.72

Larger

Smaller

Architectural Studies – Frame Delineation Architectures

Header Error Check architectures easier to scale than pattern based architectures.

Examined feasibility of driving a a common programmable frame delineation architecture for layer-2 protocols (PPP, ATM, GFP) operating at at least 2.5 Gbps.

Unable to derive due to diverse nature of techniques used

Implementation of GFP and ATM frame delineation techniques, which are based on a similar header error check method, have shown significant diversity and restrictions for a common architecture.

However, a programmable architecture that is slightly faster and smaller can be derived that is highly suitable for standard cell based implementation – key aspects reduction of registers by 50%, a key cost

Frame Delineation Architectures - Conclusions

Options

Multiplexed specific-purpose circuits

FPGA that can be reconfigured to implement a specific protocol

First the more efficient implementation (area and speed) for 10 or less protocols

Derivation of a programmable datapath based on common low level functional elements is a potential low hardware cost option

Data-Link Layer

Protocol processing

Frame Check Sequence

Frame Check Sequence Circuits

• Data integrity is paramount for data-link layer protocols

• Cycle Redundancy Check (CRC) is the preferred methodology detection bit and burst errors in payloads of protocols due to medium related noise.

• Commonly used CRC types for layer -2 protocols

CRC Type Application Field CRC-4 CRC-5 CRC-6

Frame alignment for terminal equipment

CRC-8 ATM cell header error control CRC-10 ATM Adaptation Layer type 3 and 4 CRC-16 HDLC, Wide Area Network (WAN) CRC-32 AAL-5, HDLC, PPP, GFP, Ethernet

Investigated Architectures

• Hardwired parallel CRC circuits for a given port size and generator polynomial G(x).

• Semi-reconfigurable parallel CRC circuit with reconfigurable input port size and a given generator polynomial G(x).

• Fully reconfigurable CRC computation circuit for any generator polynomial G(x) of up to the power of 32 and port sizes of 4, 8, 16, 24, and 32 bits.

Parallel Hardwired CRC-8 Circuit

X7

X’7

D7

+

X6

X’6

D6

+

X5

X’5

D5

+

X4

X’4

D4

+

X3

X’3

D3

+

X2

X’2

D2

+

X1

X’1

D1

+

X0

X’0

D0

+

+ X’7 +

+ + X’6

+ +

+ +

+ + +

+ + +

+ +

+ +

X’5

X’4

X’3

X’2

X’1

X’0

(1)

(2)

Parallel CRC-32 with Programmable Input Bus

Ou

tpu

t Da

ta B

uffer

Input Data Buffer

Data Bytes Select

Input Data

32

CRC Output

32

2

Bits 24-31

32

Bits 16-23 Bits 8-15

Bits 0-7

Byte Reorder

Mechanism

CRC Computational Matrix

F

A

C

P1

P2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

CRC

21

CRC

CRC

CRC

F

F

A

C

P1

P2

1

2

3

4

Transmit

Start of Frame

Start of Frame

Programmable input bus is required, if the frame size is not a multiple of the port size, or the frame data is not aligned to theLess-Significant Byte (LSB) of the input bus as illustrated below

Input bus configuration

Requires feedback circuit reconfiguration

CRC Polynomial

ATM HEC x8+x2+x+1

HDLC x16+x15+x2+1

Ethernet x32+x26+x23+x22+x16+x12+x11+

x10+x9+x7+x5+x4+x2+x+1 Port Size 8-Bit

16-Bit 32-Bit

64-Bit

8-Bit

16-Bit

32-Bit

64-Bit

8-Bit

16-Bit

32-Bit

64-Bit

ALUTs

16 54

126

288

37

54

110

319

117

167

257

662

ALMs

8

48

87

179

35

36

81

205

87

127

196

474

LABs

1

10

14

28

6

6

16

34

15

23

35

85

Registers

8

8

8

8

16

16

16

16

32

32

32

32

Clock Freq (MHz)

240.3

240.9

238.6

170.7

237.6

238.4

237.5

194.25

237.5

237.5

233

172.6

Data Throughput

(Mbps)

1922.4

3854.4

7635.2

10924.8

1900.8

3814.4

7600

12432

1900

3800

7456

11046.4

Dedicated Parallel CRC architectures

These values will be compared with the fully reconfigurable architecture


Programmable CRC with Partial Programmability (uses multiplexing)

64-Bit CRC Calculator

2 CRC 32x32 Matrices


1 CRC 32x32 Matrix


1 CRC 32x16 Matrix

8-Bit CRC Calculator 1 CRC 32x8 Matrix

64 Data In

32

Data-Path Configuration

CRC Output

Port Size Configure

32

32

32

32

Programmable Data-Path CRC-32 Port Size 32-Bit 64-Bit ALUTs 350 872 ALMs 246 520 LABs 40 83

Registers 32 32 Clock Freq (MHz) 156 92.08 Data Throughput

(Mbps) 4992 5893


Parallel and Fully Reconfigurable CRC Computation Circuit for High Speed Data Processing

32x32 CRC Cell Array

Row 3

Row 5

Row 6

Row 7

Row 31

Row 4

Col 5 Col 6 Col 8 Col 7 Col 9 Col 31

+ ‘0’

Port Size Configure

D5

CRC0

+ ‘0’

D6

+ ‘0’

D7

+ ‘0’

D8

+ ‘0’

D9

+ ‘0’

D31

CRC1

CRC2

CRC3

CRC4

‘0’

‘0’

‘0’

‘0’

‘0’

‘0’

CRC Size Configure

Current CRC Value

Col 0

D0

+

Col 1

D1

+

Col 2

D2

+

Col 3

D3

+

Col 4

+ ‘0’

D4

Row 2 ‘0’

Row 1 ‘0’

Row 0 ‘0’

CRC311

CRC6

CRC7

CRC5

Previous CRC Data Routed according to Port Size and CRC Size parameters

Previous C

RC

Data

CRC Size

Port Size

Polynomial

Generate Matrix

P Interface

Compute Matrix Column by Column

Counter

Reset

One-Hot Column Configuration Enable – Selected Via Counter

Configuration Data Broadcast to every

Column

Patent Pending

Max CLK: 114.98 MHzALUTs: 2240Register: 1365ALMs: 1620LABs: 292

Supports throughput rates above 2.5Gbps

(~3.68 Gbps)


Cadence Encounter – UMC-130nmClock frequency: 125 MHz

Total area: 0.27mm2

Throughput: 8.0 Gbps

Total-power: 5.9x10-3 Watts

Internal-power: 4.2x10-3 Watts

Switching-power: 1.6x10-3 Watts

Leakage-power: 1.2x10-4 Watts

Parallel and Fully Reconfigurable CRC Computation Circuit for High Speed Data Processing

UMC-130nm Reference Design

Performance Evaluation

• There is a trade-off cost for programmability

– Fully reconfigurable CRC is 8x larger and 2x slower than the hard-wired CRC-32

– For CRC’s with small polynomials and input bus sizes, the area cost difference can be a factor 100

• Hardware efficient programmability for parallel FEC circuits can be achieved by multiplexing between different custom implementations


• If polynomial G(x) is not known, then a fully programmable implementation is an appropriate solution

• Other applications include storage where CRC computation can be 30% of overall

Programmable Packet Scheduling

Programmable IP packet scheduling

Programmability of Internet protocol packet scheduling an essential feature to deal with

• Complex traffic profiles and heterogeneity of service

• Efficient bandwidth resource utilisation

• To provide on-demand and customised QoS support

Current Internet QoS Problems

• Internet Routers support best effort packet delivery only Best Effort Service

• Delay guarantee for delay sensitive services cannot be provided

• Real time and interactive services (Voice, Video) will not meet users’ expectation of quality

Router

Telnet

Web (html)

FTP

VoIP

VoD

How we can provide QoS in Internet

Single Lane Model

Current InternetBest Effort Service

Multiple Lane Model“The Motorway”

Proposed and partially deployed Method

Differential Services“DiffServ”

Resource Reservation Model Railway / Aircraft

Proposed MethodIntegrated Services

“IntServ”

Packet scheduling is paramount for QoS

• A packet scheduler decides when to send each packet based on– Traffic type ID (Tag, DiffServe)– Flow ID (Source Destination Address, IntServ)

• Scheduling algorithm and the deployed service policy determines the QoS performance of the Network

• Scheduling Method Tradeoffs:Computation Complexity Desired Fairness

Router / Switch Output Control

PacketClassifier

Flow 2

Flow 1

Flow N

Scheduler

Switch adaptation via partial reconfiguration

Input Processor

Output Processor

Cell Input

Cell Output

Reconfigurable FIFO (Number of FIFOs and FIFO Size)

Memory Control

FIFO Control 3

Buffer Access

Control

FIFO Control 0

FIFO Control 2

FIFO Control 1

Data Latch

SRAM

32 Bit Data Bus 32 Bit

32 Bit

19 Bit Address Bus

CS

WR

WR Queue ID

Queue Full

Queue Idle

RD Queue ID

Queue WR

Queue RD

32 Bit

32 Bit

Adaptation achieved by partially reconfiguring FPGA by adding or removing (i.e. reconfiguring) packet FIFO circuits and output packet schedulers

Partial Reconfiguration

Issues relating to run-time, gate level configurable logic

• Limited memory resources, off chip memory access a bottleneck

• Reconfiguration interrupts traffic flow - QoS degradation.

• Partial reconfiguration is limited to similar scheduling policies.

• Runtime and partial configuration adds additional complexity

• Partial reconfiguration control remains an unsolved challenge despite a promising model.

• Current FPGA technology immature in terms of design tools and run-time reconfiguration support

Systems Level Approach for Programmable Packet Scheduling

Packet Classification(Traff. Flow/Class, QoS)

Finishing Tag Computation

Tag Lookup Table Write Control

Shared Buffer Write Control

External Shared PacketBuffer

Packet Server

FT AP R

Finishing Tag Lookup Table

FT AP R

Finishing Tag Lookup Table

Tag Lookup Table Read Control

Scheduler Input

Scheduler output

Packet location pointer

Packet location pointer

Scheduling policy determines the computation and the use of parameters

Patent Pending

Packet handling isolated from schedule policy functions

Systems Level Approach for Programmable Packet Scheduling

• Packet handling functions isolated from scheduler policy specific functions.

• Individual queues replaced by a more complex shared buffer architecture that can support multiple queues– controlled via address pointers and link lists.

• Provides clear separation of the packet scheduling architecture into – a circuit purely responsible for dealing with packet service policies

i.e. scheduling algorithms and– a circuit concerned with packet handling e.g. store/retrieve

• Allows flexible programmability of the scheduling policy and number of packet queues without reconfiguring the hardware

• Comparable throughput rates to implementations with physically built queues

Cadence Encounter UMC130nmClock frequency: 143 MHz

Number of IOs: 478 Pins

Total area: 14.4 mm2

Number of Sessions: 1,000,000Number of Packets: External DDR Up to 30 Million packets can be support

Throughput: 35.8M packets/secThroughput: ~ 40 Gbps line rate(assuming a mean IP packet size of 130 bytes)

Configurable Packet Scheduling

Address Translation Table

Search TrieMemory

90% distributed embedded Memory

Patent Pending

Conclusions

Queue and scheduling policy adaptation via address pointer, lookup tables and packet time-stamp processing

Low programming complexity

Service requirements can be translated immediately

No traffic interruption is required

instant change of queues and scheduling policies

Conclusions

Performance comparable with customized implementations

Complex, but affordable data processing hardware

Programmability at the system level NOT reconfigurability at gate level

Programming scheduler does not require place and route at gate level

Programmable

Cryptographic Architectures

Programmable Cryptographic Architectures

• Encryption needs to be performed on data in real-time– 100 Mbps networks, 1G Ethernet, 10G Ethernet

• This holds the key to successful growth of applications such as WLANs, satellite communications, e-businesses …

• Software architectures are too slow

• Hardware solutions required

Programmable Cryptographic Architectures

• Reconfigurable Cryptographic Architectures can be used to provide the security requirements of many applications

• FPGAs are well suited for crypto algorithms:– allow algorithm agility

– support alterable architecture parameters, scalable security (DES/ 3DES)

• Clever mapping of complex math operations onto special purpose silicon architectures

Private Key Algorithms: AES

• NIST requested a new Advanced Encryption Standard (AES) to replace DES - Sept 1997

• Interim measure – TripleDES

• RIJNDAEL : AES Winner - Oct 2000

• Developed by Joan Daemen, Vincent Rijmen

• Replaced DES as Federal Standard in November 2001

• 128-bit Data, 128, 192 or 256-bit Key

Reconfigurable AES Architecture

• In conjunction with AES, NIST recommended 5 modes of operation

– Electronic Codebook (ECB) mode

– Cipher Block Chaining (CBC) mode

– Output Feedback (OFB) mode

– Ciphertext Feedback (CFB) mode

– Counter (CTR) mode: a simplification of OFB mode

Private Key Algorithms: AES

PlainText

Key

Data/KeyAddition

CipherTextRnd

0Rnd

8FinalRnd

KeySchedule

…

…

ByteSub

ShiftRow

MixCol

Key Addition

Reconfigurable AES Architecture

• Reconfigurable AES architecture with following features

– Iterative architecture

– On-chip key scheduling

– Support for 3 key lengths

– Encryption & Decryption

– Support for feedback modes of operation


Reconfigurable AES V Specific-purpose Enc/Dec

Device AreaThroughput

(Mbps)

AES Encryptor 128-bit Key

XCV400E1987 Slices18 BRAMs

423

AES Decryptor 128-bit Key


557

AES Enc/Dec 128-bit key Supports 5 modes


310


• 2 additional BRAMs required in reconfigurable design as memory re-use possible

• Reconfigurable Design

– Throughput reduced up to 40%

– Area increased by 10%

• However, modes of operation supported

=> Area/speed penalty acceptable trade-off in

favour of using reconfiguration over multiple

specific-purpose circuits

Conclusions

Conclusions

• Frame Delineation

– Common architecture could not be found– Separate FPGA circuit for each or multiplex between separate circuits on

an ASIC– Derivation of a programmable datapath based on common low

level functional elements is a potential low hardware cost option

• CRC circuits –well defined (i.e. 8 ) options

– Fully reconfigurable ASIC possible but larger and slower than 8 separate versions

– If G(x) and number of options is not known then use fully programmable solution

Conclusions

• Packet Scheduler

– Systems level approach deploying address pointer, lookup tables and packet time-stamp processing the most appropriate approach

– Enables programmability while supporting line rates beyond 100 Gbps

– Best approach, tackle at the Systems and Architecture level rather than FPGA level

– Current FPGA technology and design tools for run-time reconfiguration too immature for packet scheduling

Conclusions

• Encryption/Decryption

– Re-configurable architecture identified

– Supports a number of modes of operation

• Reconfigurable Design

– Throughput reduced up to 40%

– Area increased by 10%

– Area/speed penalty acceptable trade-off compared with reconfiguration of multiple specific circuits

The Institute of Electronics, Communications and Information Technology

Professor John McCanny CBE FRS FREngDr Sakir Sezer ([email protected]) ,

Dr Maire McLoone (m.mcloone@ecit,qub.ac.uk)

Reconfigurable Architectures for High Bandwidth Network Processing Systems

Institute of Electronics Communications and Information Technology

You see things and say “Why?” but I dream things

that never were and say “Why not?”

George Bernard Shaw

the institute of electronics, communications and information technology reconfigurable architectures...

Documents

physical link

ebusinessnetwork processing

telephone line

twisted pair

reconfigurable architectures

optical fiber

copper coax

adapting network nodes