the institute of electronics, communications and information technology reconfigurable architectures...
TRANSCRIPT
The Institute of Electronics, Communications and Information Technology
Reconfigurable Architectures for High Bandwidth Network Processing Systems
Professor John McCanny CBE FRS FREng
Dr Sakir Sezer, Dr Maire McLoone
Institute of Electronics Communications and Information Technology
You see things and say “Why?” but I dream things
that never were and say “Why not?”
George Bernard Shaw
Purpose of Talk
• An overview of Research on reconfigurable architectures for Network Processing applications
• Three aspects – Node throughput
– QoS
– Data security
Structure of talk
• Convergence of Communication systems
• Processing demands of future networks
• Trade-offs of reconfigurability in network processing in the context of Application specific architectures for
– Programmable Data-Link layer Datagram Processing
– Programmable Packet scheduling Architectures
– Configurable Cryptographic Architectures
• Conclusions
Convergence of Communication and Information Systems
WIMAX SAT, Optical
4GMobile IP
Internet Based Global Communication
Best effort
Services
Real Time Interactive Services
TelecommunicationBroadcast, Multicast
VoD, TV, RadioComputer Communication
Convergence of Technology, Applications and Services
FixedX-DSL
Mobile3GUMTS
CurrentFuture
10 Kbps
100 Kbps
1 Mbps
10 Mbps
100 Mbps
WLAN
Bandwidth Demand Vs Moore’s Law
Existing data processing architectures are unable to keep up with network processing demands !!
Moore’s Law Silicon Integration
Capabilitydoubles every
18 Months
Internet Traffic doubles every
12 Months
TechnologTechnologyy
GapGap
Data processing demand at network- and access
nodes doubles every 6-9 Months Network
ProcessingGap
Issues
• Internet traffic is continuously doubling every 12 months
• Emerging services require:– Higher bandwidth (VoD, DVB-IP, VoIP)– Higher degree of security (Internet Banking, internet shopping, e-
business)
• Network processing demands a consequence of – Smaller packet sizes of real-time and interactive services – QoS requirements of real-time and interactive services – Complex security processing of sensitive data– Network protection from viruses and intruders
Network Processors Architectures -High performance with flexibility
To cope with exponential growth in bandwidth demands
Complex traffic profiles and heterogeneity of service
To efficiently utilise resources by dynamically adapting network nodes to various traffic patterns
Capability for on-demand and customised QoS support
Cost effective upgrades to new communication protocols
Ideally high levels of compute power, high levels of flexibility
Application Specific, Configurable Network Processing Architectures
• Programmable Data-Link layer Datagram Processing– Frame Delineation– Frame Check Sequence
• Programmable Packet scheduling Architectures– Logic-level reconfigurable packet scheduling architecture– System-level configurable packet scheduling architecture
• Configurable Cryptographic Architectures– Iterative and Non-iterative architectures
Data-Link Layer
Protocol processing
Data-Link layer Protocol processing
Data Link Layer protocols enables a point-to-point connection between two peers over a physical link.
F H Payload FCS F Line Card
e.g. Ethernet Card
or ADSL Modem
Client or Router
e.g PC
Router
Physical layer
Data Link Layer
Network Layer
Modulation, Error Correction
H Payload
PHY PHY
Internet Protocol (IP) Packet
Ethernet, ESCON or PPP Frame ∙∙∙0100110101001010011∙∙∙
Raw Bit Stream
Line Card e.g.
Ethernet Card or
ADSL Modem
Physical Medium (wired or wireless) e.g. Optical Fiber, Copper (coax, twisted pair, telephone line), free air
F: Frame delimiter H: Frame Header
FCS: Frame Check sequence
Common Data Link Layer protocols are: ATM, Ethernet, PPP, GFP, Frame Relay, Fibre Channel etc.
IP over SDH/SONET Data Link Layer Protocols
SDH/SONET
Internet Protocol
EthernetBridge, VLAN
PPP GFPFrameRelay
ATM(MPOA, MPLS,..)
Network Layer
DataLink Layer
PHY Layer
Legacy Protocols Emerging Protocols
Data-link layer frame processing
• Frame processing involves two key functions: – frame delineation and – frame check sequence (FCS)
• The circuit architectures for both functions are determined by– the protocols and – the data-path width
• Numerous Frame Delineation and FCS architectures for PPP, ATM and GFP investigated.– Scalability– Throughput– Hardware costs
• Programmable frame processing architecture is desirable to support a variety of protocols
Data-Link Layer
Protocol processingFrame Delineation
PPP 32-bit ACCM Transmitter Circuit
32 Bit
XOR
XOR
XOR
XOR
= 7D or 7E Comparator
64 Bit
64 Bit
Data Reorganiser – Control Unit
Input Buffer Flag
32 Bit
32 Bit
Feedback Buffer Flag
64 Bit
7E
8 Bit
ReadFlag
Input Buffer
Feedback Buffer
Output Buffer
Data Status
From CRC
Generator Output FIFO
Includes Asynchronous-Control-Character-Map (ACCM) function.
PPP Frame Delineation CircuitPost-layout Synthesis - Altera Stratix II
PPP ACCM Transmitter PPP ACCM Receiver
Area Speed Area Speed Data-Path
Width ALUTS Registers
ALMs
Speed (MHz)
Data Throughput
(Mbps) ALUTS Registers
ALMs
Speed (MHz)
Data Throughput
(Mbps)
8-Bit 18 14 10 251.09 2008.72 19 6 16 256.21 2049.68
16-Bit 123 87 72 273.29 4372.78 84 62 46 268.9 4302.4
32-Bit 368 191 207 258.59 8274.88 363 165 196 263.92 8445.5
64-Bit 1463 466 873 173.16 11082.24 1740 338 929 161.23 10318.72
128-Bit 15231 1129 9592 44.69 5720.32 13985 764 8686 36.43 4663.04
Hardware Complexity O(N)=N2
8 bit and 32 bit Data Paths
32 bit data path requires additional hardware for rearrangement of data words before and after transmission.
Scaling involves a significant area penalty.
Complex data reorganization circuits designed to overcome the limitations set by an octet based protocol
Requires an increased logic cost by factors of 15 and 26 for the ACCM receiver and the ACCM transmitter circuits respectively.
Majority of logic increase due to the number of byte comparators, as well as the provision of extra routing and the conditional multiplexers
• ATM Frame – 5 byte header, 48 byte payload
• Based on Cyclic Coding
• Header Error Check HEC
– Cyclic Redundancy Check (CRC)
– Provides header error detection and frame delineation
– 5th header byte (HEC) calculated from CRC computation of 1st 4 header bytes
– CRC polynomial G(x) = x8+x2+x+1
ATM Frame Delineation
GFC[UNI]/VPI VPI
VPI VCI
VCI
VCI PT CLP
HEC
8 Bits
ATM Header CRC Computation
G(x) = x8+x2+x+1
First 4 header bytes
ATM Bit-by-Bit HEC HUNT 4-Bit Data-Path Architecture
Bit 1 Frame Boundary Check
Bit 0 Frame Boundary Check
CRC Calc
2
CRC Reset
Data In
Data Buffer Pipeline
CRC Calc
1 XOR A
XOR A
XOR A
4 8
CRC Calc
8
CRC Calc 10
CRC Calc
9 XOR A
XOR A
XOR A
4 8
CRC Calc 16
Compare
Data Out 4
4
Compare
8 cycles of Data
Compare with next 2
nibbles
Enable O/P if Match
Reset CRC Unit
4-Bit
8-Bit
16-Bit
32-Bit
64-Bit
Logic Cells
402
955
1159
1274
2856
Registers
202
348
370
386
706
Clock Frequency (MHz)
171.97
204.42
160.41
127.94
106.39
Data Throughput (Mbps)
687.88
1635.36
2566.56
4098.08
6808.96
4, 8, 16, 32 and 64 bit implementations
Altera Stratix Technology
16 bit design - 2.5 Gbps supports SONET OC48 line rate 64 bit design - 6.8 Gbps
Generic Frame Procedure (GFP)
• The Generic Frame Procedure is a Layer-2 framing protocol for data over high-capacity optical networks.
• Recently standardised (ITU-T G.7041) to replace ATM and PPP in high capacity Wide Area Networks (WANs)
• GFP is scalable, allowing the implementation of wide data-path architectures.
• GFP deploys a CRC based frame delineation architecture similar to ATM HEC HUNT and synchronisation technique
GFP Frame Structure
Tx Bit Order
Tx B
yte Ord
er
Core Header
Payload
Area
PLI
cHEC
Payload
Payload Header
Payload Type LSB
tHEC MSB
tHEC LSB
Optional
Extension
Header
0 – 60 Bytes
Payload Type MSB
PTI
UPI
EXIPFI
Optional pFCS
pFCS [31:24]
pFCS [7:0]
pFCS [15:8]
pFCS [23:16]
CID
Spare
eHEC MSB
eHEC LSB
Example:
Linear Extension Header
16-bit GFP Core Header Error Check (CHEC) field is used for frame delineation
GFP Frame Delineation 64-bit Datapath with 1-bit Header Error Correction Circuit
64
Data Buffer 1
Data Buffer 2
Data Buffer 3
Data In
CRC Calc 1 16
16
16-Bit Comparator
16
16-Bit Comparator
16
16-Bit Comparator
64 64
Data Out
88-Bit in 64-Bit out MUX
8-Bit Latch
16
16-Bit Comparator
16
16
16
Bits 0-7
Bits 8-15
Bits 16-23
Bits 24-31
16
16
16
16
8 Byte Window Gate
Bits 0-7
Bits 8-15
Bits 16-23
Bits 32-39 Bits 32-39
Bits 40-47
Bits 48-55
Bits 56-63
CRC Calc 2
CRC Calc 3
CRC Calc 4
CRC Calc 5 16
16
16-Bit Comparator
16
16-Bit Comparator
16
16-Bit Comparator
16
16-Bit Comparator
16
16
16
16
16
16
16
CRC Calc 6
CRC Calc 7
CRC Calc 8
Bits 24-31
Bits 40-47
Bits 48-55
Bits 56-63
Frame Synchronisation State Machine
Payload Counter
Bits 16-31 PLI Field
Single Bit Error
Correction Look up Circuit
Enable & Control
Enable
Enable
Max CLK: 165 MHzALUTs: 1107Register: 653ALMs: 751LABs: 149Throughput: 10.5 Gbps
Preliminary Design study
Altera Stratix II-3 FPGA Technology
Cadence Encounter – UMC-130nmClock frequency: 250 MHz
Total area: 0.12 mm2
Throughput: 16.0 Gbps
Total-power: 1.6x10-02 Watts
Internal-power: 1.4x10-02 Watts
Switching-power: 2.3x10-03 Watts
Leakage-power: 8.1x10-05 Watts
GFP Frame Delineation 64-bit Datapath with 1-bit Header Error Correction Circuit
UMC-130nm Reference Design
Fastest implementation in the literature
Programmable Frame Delineation
ATM Frame Delineation Circuit
GFP Frame Delineation Circuit
Protocol Select
CLK
32 Data In Data Out
32
Data Status
PPP Frame Delineation Circuit
Ethernet Frame Delineation Circuit
Protocol Select Enable
Programmable Data-Path PPP/GFP/Ethernet/ATM
Frame Delimiter Data path 32-Bit 64-Bit
ALUTs 1515 4183 ALMs 1099 2812 LABs 209 490
Registers 768 1523 Clock Freq
(MHz) 153.9 99.9
Data Throughput
(Mbps) 4926 6394
Altera Stratix II-3 FPGA Technology
Target 10Gbps not achievable in FPGA Technology, should be with ASIC
32-Bit Protocol Processing Circuit Decomposition
32-Bit Protocol Buffer
Registers Muxes CRC
(XOR gates) Comparators Protocol State
achine Error Correction Look Up Circuit Counter
PPP(RxD) 2 4 byte registers
2 4 byte muxes 0
8 8-bit comparators with constants 0 0 0
PPP(TxD 2 8 byte registers
2 8 byte muxes 0
8 8-bit comparators with constants 0 0 0
GFP 2 4 byte registers
1 7-byte in 4 byte out
4 16-bit in/16-bit out CRC matrices
4 16-Bit comparators
tri-state synchronisation
32 16-Bit Comparators
PLI Counter
352 XOR gates state machine with fixed constants
88 in each 16*16 matrix
ATM (byte-by-Byte)
2 4 byte registers
1 7-byte in 4 byte out
4 32-bit in/8-bit out CRC matrices
4 8-Bit comparators
tri-state synchronisation 0
48 byte counter
440 XOR gates state machine
26 in each 8x8 matrix (*4*4) + 3*8 XOR
Ethernet 2 4 byte registers
1 7-byte in 4 byte out 0
8 8-bit comparators with constants 0 0
3- bit counter
Programmable ATM/GFP Protocol Frame Delineation Architecture
GFP HEC Calculation
CRC Calc 1
CRC Calc 2
CRC Calc 4
CRC Calc 3
ATM HEC Calculation
CRC Calc 1
CRC Calc 2
CRC Calc 4
CRC Calc 3
32
Data Buffer 1
Data In
Control Unit [Dual State Machine]
=
=
=
=
32
Data Out
32
GFP Single Bit
Error Correction
Look up Circuit
Payload Counter
Enable & Control
4 Byte Window
7 byte in - 4 byte out Data-
Path Mux
Protocol Select
Data Status
CRC Engine Enable
8-bit/16-bit comparator configure
Configure Counter [PLI filed or 48 bytes]
ATM Frame Delineation Circuit
GFP Frame Delineation Circuit
32 Data In
Data Out 32
Data Status
Protocol Select
Protocol Select Enable
CLK
Shared Common Elements Separate Data Path
Area Speed Programmable ATM/GFP Frame
Delimiter ALUTs Registers ALMs LABs Clock Frequency (MHz) Data Throughput (Mbps)
Common Elements 885 387 530 78 165.65 5300.8
Separate Data-Path 872 531 621 124 159.46 5102.72
Larger
Smaller
Architectural Studies – Frame Delineation Architectures
Header Error Check architectures easier to scale than pattern based architectures.
Examined feasibility of driving a a common programmable frame delineation architecture for layer-2 protocols (PPP, ATM, GFP) operating at at least 2.5 Gbps.
Unable to derive due to diverse nature of techniques used
Implementation of GFP and ATM frame delineation techniques, which are based on a similar header error check method, have shown significant diversity and restrictions for a common architecture.
However, a programmable architecture that is slightly faster and smaller can be derived that is highly suitable for standard cell based implementation – key aspects reduction of registers by 50%, a key cost
Frame Delineation Architectures - Conclusions
Options
Multiplexed specific-purpose circuits
FPGA that can be reconfigured to implement a specific protocol
First the more efficient implementation (area and speed) for 10 or less protocols
Derivation of a programmable datapath based on common low level functional elements is a potential low hardware cost option
Data-Link Layer
Protocol processing
Frame Check Sequence
Frame Check Sequence Circuits
• Data integrity is paramount for data-link layer protocols
• Cycle Redundancy Check (CRC) is the preferred methodology detection bit and burst errors in payloads of protocols due to medium related noise.
• Commonly used CRC types for layer -2 protocols
CRC Type Application Field CRC-4 CRC-5 CRC-6
Frame alignment for terminal equipment
CRC-8 ATM cell header error control CRC-10 ATM Adaptation Layer type 3 and 4 CRC-16 HDLC, Wide Area Network (WAN) CRC-32 AAL-5, HDLC, PPP, GFP, Ethernet
Investigated Architectures
• Hardwired parallel CRC circuits for a given port size and generator polynomial G(x).
• Semi-reconfigurable parallel CRC circuit with reconfigurable input port size and a given generator polynomial G(x).
• Fully reconfigurable CRC computation circuit for any generator polynomial G(x) of up to the power of 32 and port sizes of 4, 8, 16, 24, and 32 bits.
Parallel Hardwired CRC-8 Circuit
X7
X’7
D7
+
X6
X’6
D6
+
X5
X’5
D5
+
X4
X’4
D4
+
X3
X’3
D3
+
X2
X’2
D2
+
X1
X’1
D1
+
X0
X’0
D0
+
+ X’7 +
+ + X’6
+ +
+ +
+ + +
+ + +
+ +
+ +
X’5
X’4
X’3
X’2
X’1
X’0
(1)
(2)
Parallel CRC-32 with Programmable Input Bus
Ou
tpu
t Da
ta B
uffer
Input Data Buffer
Data Bytes Select
Input Data
32
CRC Output
32
2
Bits 24-31
32
Bits 16-23 Bits 8-15
Bits 0-7
Byte Reorder
Mechanism
CRC Computational Matrix
F
A
C
P1
P2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CRC
21
CRC
CRC
CRC
F
F
A
C
P1
P2
1
2
3
4
Transmit
Start of Frame
Start of Frame
Programmable input bus is required, if the frame size is not a multiple of the port size, or the frame data is not aligned to theLess-Significant Byte (LSB) of the input bus as illustrated below
Input bus configuration
Requires feedback circuit reconfiguration
CRC Polynomial
ATM HEC x8+x2+x+1
HDLC x16+x15+x2+1
Ethernet x32+x26+x23+x22+x16+x12+x11+
x10+x9+x7+x5+x4+x2+x+1 Port Size 8-Bit
16-Bit 32-Bit
64-Bit
8-Bit
16-Bit
32-Bit
64-Bit
8-Bit
16-Bit
32-Bit
64-Bit
ALUTs
16 54
126
288
37
54
110
319
117
167
257
662
ALMs
8
48
87
179
35
36
81
205
87
127
196
474
LABs
1
10
14
28
6
6
16
34
15
23
35
85
Registers
8
8
8
8
16
16
16
16
32
32
32
32
Clock Freq (MHz)
240.3
240.9
238.6
170.7
237.6
238.4
237.5
194.25
237.5
237.5
233
172.6
Data Throughput
(Mbps)
1922.4
3854.4
7635.2
10924.8
1900.8
3814.4
7600
12432
1900
3800
7456
11046.4
Dedicated Parallel CRC architectures
These values will be compared with the fully reconfigurable architecture
Altera Stratix II-3 FPGA Technology
Programmable CRC with Partial Programmability (uses multiplexing)
64-Bit CRC Calculator
2 CRC 32x32 Matrices
32-Bit CRC Calculator
1 CRC 32x32 Matrix
16-Bit CRC Calculator
1 CRC 32x16 Matrix
8-Bit CRC Calculator 1 CRC 32x8 Matrix
64 Data In
32
Data-Path Configuration
CRC Output
Port Size Configure
32
32
32
32
Programmable Data-Path CRC-32 Port Size 32-Bit 64-Bit ALUTs 350 872 ALMs 246 520 LABs 40 83
Registers 32 32 Clock Freq (MHz) 156 92.08 Data Throughput
(Mbps) 4992 5893
Altera Stratix II-3 FPGA Technology
Parallel and Fully Reconfigurable CRC Computation Circuit for High Speed Data Processing
32x32 CRC Cell Array
Row 3
Row 5
Row 6
Row 7
Row 31
Row 4
Col 5 Col 6 Col 8 Col 7 Col 9 Col 31
+ ‘0’
Port Size Configure
D5
CRC0
+ ‘0’
D6
+ ‘0’
D7
+ ‘0’
D8
+ ‘0’
D9
+ ‘0’
D31
CRC1
CRC2
CRC3
CRC4
‘0’
‘0’
‘0’
‘0’
‘0’
‘0’
CRC Size Configure
Current CRC Value
Col 0
D0
+
Col 1
D1
+
Col 2
D2
+
Col 3
D3
+
Col 4
+ ‘0’
D4
Row 2 ‘0’
Row 1 ‘0’
Row 0 ‘0’
CRC311
CRC6
CRC7
CRC5
Previous CRC Data Routed according to Port Size and CRC Size parameters
Previous C
RC
Data
CRC Size
Port Size
Polynomial
Generate Matrix
P Interface
Compute Matrix Column by Column
Counter
Reset
One-Hot Column Configuration Enable – Selected Via Counter
Configuration Data Broadcast to every
Column
Patent Pending
Max CLK: 114.98 MHzALUTs: 2240Register: 1365ALMs: 1620LABs: 292
Supports throughput rates above 2.5Gbps
(~3.68 Gbps)
Altera Stratix II-3 FPGA Technology
Cadence Encounter – UMC-130nmClock frequency: 125 MHz
Total area: 0.27mm2
Throughput: 8.0 Gbps
Total-power: 5.9x10-3 Watts
Internal-power: 4.2x10-3 Watts
Switching-power: 1.6x10-3 Watts
Leakage-power: 1.2x10-4 Watts
Parallel and Fully Reconfigurable CRC Computation Circuit for High Speed Data Processing
UMC-130nm Reference Design
Performance Evaluation
• There is a trade-off cost for programmability
– Fully reconfigurable CRC is 8x larger and 2x slower than the hard-wired CRC-32
– For CRC’s with small polynomials and input bus sizes, the area cost difference can be a factor 100
• Hardware efficient programmability for parallel FEC circuits can be achieved by multiplexing between different custom implementations
Performance Evaluation
• If polynomial G(x) is not known, then a fully programmable implementation is an appropriate solution
• Other applications include storage where CRC computation can be 30% of overall
Programmable Packet Scheduling
Programmable IP packet scheduling
Programmability of Internet protocol packet scheduling an essential feature to deal with
• Complex traffic profiles and heterogeneity of service
• Efficient bandwidth resource utilisation
• To provide on-demand and customised QoS support
Current Internet QoS Problems
• Internet Routers support best effort packet delivery only Best Effort Service
• Delay guarantee for delay sensitive services cannot be provided
• Real time and interactive services (Voice, Video) will not meet users’ expectation of quality
Router
Telnet
Web (html)
FTP
VoIP
VoD
How we can provide QoS in Internet
Single Lane Model
Current InternetBest Effort Service
Multiple Lane Model“The Motorway”
Proposed and partially deployed Method
Differential Services“DiffServ”
Resource Reservation Model Railway / Aircraft
Proposed MethodIntegrated Services
“IntServ”
Packet scheduling is paramount for QoS
• A packet scheduler decides when to send each packet based on– Traffic type ID (Tag, DiffServe)– Flow ID (Source Destination Address, IntServ)
• Scheduling algorithm and the deployed service policy determines the QoS performance of the Network
• Scheduling Method Tradeoffs:Computation Complexity Desired Fairness
Router / Switch Output Control
PacketClassifier
Flow 2
Flow 1
Flow N
Scheduler
Switch adaptation via partial reconfiguration
Input Processor
Output Processor
Cell Input
Cell Output
Reconfigurable FIFO (Number of FIFOs and FIFO Size)
Memory Control
FIFO Control 3
Buffer Access
Control
FIFO Control 0
FIFO Control 2
FIFO Control 1
Data Latch
SRAM
32 Bit Data Bus 32 Bit
32 Bit
19 Bit Address Bus
CS
WR
WR Queue ID
Queue Full
Queue Idle
RD Queue ID
Queue WR
Queue RD
32 Bit
32 Bit
Adaptation achieved by partially reconfiguring FPGA by adding or removing (i.e. reconfiguring) packet FIFO circuits and output packet schedulers
Partial Reconfiguration
Issues relating to run-time, gate level configurable logic
• Limited memory resources, off chip memory access a bottleneck
• Reconfiguration interrupts traffic flow - QoS degradation.
• Partial reconfiguration is limited to similar scheduling policies.
• Runtime and partial configuration adds additional complexity
• Partial reconfiguration control remains an unsolved challenge despite a promising model.
• Current FPGA technology immature in terms of design tools and run-time reconfiguration support
Systems Level Approach for Programmable Packet Scheduling
Packet Classification(Traff. Flow/Class, QoS)
Finishing Tag Computation
Tag Lookup Table Write Control
Shared Buffer Write Control
External Shared PacketBuffer
Packet Server
FT AP R
Finishing Tag Lookup Table
FT AP R
Finishing Tag Lookup Table
Tag Lookup Table Read Control
Scheduler Input
Scheduler output
Packet location pointer
Packet location pointer
Scheduling policy determines the computation and the use of parameters
Patent Pending
Packet handling isolated from schedule policy functions
Systems Level Approach for Programmable Packet Scheduling
• Packet handling functions isolated from scheduler policy specific functions.
• Individual queues replaced by a more complex shared buffer architecture that can support multiple queues– controlled via address pointers and link lists.
• Provides clear separation of the packet scheduling architecture into – a circuit purely responsible for dealing with packet service policies
i.e. scheduling algorithms and– a circuit concerned with packet handling e.g. store/retrieve
• Allows flexible programmability of the scheduling policy and number of packet queues without reconfiguring the hardware
• Comparable throughput rates to implementations with physically built queues
Cadence Encounter UMC130nmClock frequency: 143 MHz
Number of IOs: 478 Pins
Total area: 14.4 mm2
Number of Sessions: 1,000,000Number of Packets: External DDR Up to 30 Million packets can be support
Throughput: 35.8M packets/secThroughput: ~ 40 Gbps line rate(assuming a mean IP packet size of 130 bytes)
Configurable Packet Scheduling
Address Translation Table
Search TrieMemory
90% distributed embedded Memory
Patent Pending
Conclusions
Queue and scheduling policy adaptation via address pointer, lookup tables and packet time-stamp processing
Low programming complexity
Service requirements can be translated immediately
No traffic interruption is required
instant change of queues and scheduling policies
Conclusions
Performance comparable with customized implementations
Complex, but affordable data processing hardware
Programmability at the system level NOT reconfigurability at gate level
Programming scheduler does not require place and route at gate level
Programmable
Cryptographic Architectures
Programmable Cryptographic Architectures
• Encryption needs to be performed on data in real-time– 100 Mbps networks, 1G Ethernet, 10G Ethernet
• This holds the key to successful growth of applications such as WLANs, satellite communications, e-businesses …
• Software architectures are too slow
• Hardware solutions required
Programmable Cryptographic Architectures
• Reconfigurable Cryptographic Architectures can be used to provide the security requirements of many applications
• FPGAs are well suited for crypto algorithms:– allow algorithm agility
– support alterable architecture parameters, scalable security (DES/ 3DES)
• Clever mapping of complex math operations onto special purpose silicon architectures
Private Key Algorithms: AES
• NIST requested a new Advanced Encryption Standard (AES) to replace DES - Sept 1997
• Interim measure – TripleDES
• RIJNDAEL : AES Winner - Oct 2000
• Developed by Joan Daemen, Vincent Rijmen
• Replaced DES as Federal Standard in November 2001
• 128-bit Data, 128, 192 or 256-bit Key
Reconfigurable AES Architecture
• In conjunction with AES, NIST recommended 5 modes of operation
– Electronic Codebook (ECB) mode
– Cipher Block Chaining (CBC) mode
– Output Feedback (OFB) mode
– Ciphertext Feedback (CFB) mode
– Counter (CTR) mode: a simplification of OFB mode
Private Key Algorithms: AES
PlainText
Key
Data/KeyAddition
CipherTextRnd
0Rnd
8FinalRnd
KeySchedule
…
…
ByteSub
ShiftRow
MixCol
Key Addition
Reconfigurable AES Architecture
• Reconfigurable AES architecture with following features
– Iterative architecture
– On-chip key scheduling
– Support for 3 key lengths
– Encryption & Decryption
– Support for feedback modes of operation
Performance Evaluation
Reconfigurable AES V Specific-purpose Enc/Dec
Device AreaThroughput
(Mbps)
AES Encryptor 128-bit Key
XCV400E1987 Slices18 BRAMs
423
AES Decryptor 128-bit Key
XCV600E2121 Slices18 BRAMs
557
AES Enc/Dec 128-bit key Supports 5 modes
XCV600E4681 Slices20 BRAMs
310
Performance Evaluation
• 2 additional BRAMs required in reconfigurable design as memory re-use possible
• Reconfigurable Design
– Throughput reduced up to 40%
– Area increased by 10%
• However, modes of operation supported
=> Area/speed penalty acceptable trade-off in
favour of using reconfiguration over multiple
specific-purpose circuits
Conclusions
Conclusions
• Frame Delineation
– Common architecture could not be found– Separate FPGA circuit for each or multiplex between separate circuits on
an ASIC– Derivation of a programmable datapath based on common low
level functional elements is a potential low hardware cost option
• CRC circuits –well defined (i.e. 8 ) options
– Fully reconfigurable ASIC possible but larger and slower than 8 separate versions
– If G(x) and number of options is not known then use fully programmable solution
Conclusions
• Packet Scheduler
– Systems level approach deploying address pointer, lookup tables and packet time-stamp processing the most appropriate approach
– Enables programmability while supporting line rates beyond 100 Gbps
– Best approach, tackle at the Systems and Architecture level rather than FPGA level
– Current FPGA technology and design tools for run-time reconfiguration too immature for packet scheduling
Conclusions
• Encryption/Decryption
– Re-configurable architecture identified
– Supports a number of modes of operation
• Reconfigurable Design
– Throughput reduced up to 40%
– Area increased by 10%
– Area/speed penalty acceptable trade-off compared with reconfiguration of multiple specific circuits
The Institute of Electronics, Communications and Information Technology
Professor John McCanny CBE FRS FREngDr Sakir Sezer ([email protected]) ,
Dr Maire McLoone (m.mcloone@ecit,qub.ac.uk)
Reconfigurable Architectures for High Bandwidth Network Processing Systems
Institute of Electronics Communications and Information Technology
You see things and say “Why?” but I dream things
that never were and say “Why not?”
George Bernard Shaw