Download - Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN

Lendo e apresentando:

Forwarding Metamorphosis:

Fast Programmable Match-

Action Processing in Hardware

for SDN

Bruno Castelucci – 2013.10.31

1

References:

Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN (Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, Mark Horowitz) SIGCOMM - 2013.08.13

Reading: Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN (Ji Yang) 2013.09.23

2

Objetivos

• O Papper descreve a implementação de um Switch de 64 ports de 10Gbps usando o modelo RMT (Reconfigurable Match Tables).

• Desenhar uma arquitetura para o RMT.

• Mostrar alguns casos de uso.

3

Exemplo: Switch padrão

4

Dep

ars

er

In

Queues

Data

Out ACL Stage

L3 Stage

L2 Stage

Act

ion

: se

t L

2D

Stage 1

L2

Ta

ble

L2: 128k x 48 Exact match

Act

ion

: se

t L

2D

, d

ec T

TL

Stage 2

L3

Ta

ble

L3: 16k x 32 Longest prefix match

Act

ion

: p

erm

it/d

eny

Stage 3 A

CL

Ta

ble

ACL: 4k Ternary match

Pa

rser

X X X X X

Access Control List (ACL) Time to live (TTL)

Forwarding Plane is Challenged

• Current Hardware switches are rigid

▫ Match Action can process limited set fields

• OpenFlow/SDN protocol defines a limited repertoire of packet processing actions ▫ Updated frequently

• Hardware still has its vantages ▫ Speed: Faster than CPU ▫ Pipeline, parallelism

What if you need flexibility?

• Flexibility to: ▫ Trade one memory size for another ▫ Add a new table ▫ Add a new header field ▫ Add a different action

• SDN accentuates the need for flexibility ▫ Gives programmatic control to control plane, expects

to be able to use flexibility Multiple stages of match-action Flexible actions Flexible header fields

6

What about Alternatives?

Aren’t there other ways to get

flexibility?

• Software? 100x too slow, expensive

• NPUs? 10x too slow, expensive

• FPGAs? 10x too slow, expensive

7

What’s Hard about a

Flexible Switch Chip?

• Big chip

• Wiring intensive

• Many crossbars

• Lots of TCAM

• Operate in high frequency – 1TBps

8

The RMT Model

9

Hardware Architectures for SDN

• Single Match Table ▫ Easier to implement. ▫ Needs to store every combination of headers: too big.

• Multiple Match Tables ▫ Allows multiple smaller match tables to be matched by a subset of packet fields. ▫ Number of and size of are limited: hard to optimize and add new behaviors ▫ Very few actions implemented today on current chips.

• Reconfigurable Match Tables ▫ Work under pipeline ▫ New fields added easily ▫ Number, topology, widths, depths of match tables can be configured ▫ New actions defined easily ▫ Packets can be placed in specified queues (queuing discipline specified)

• RMT permite que o plano de encaminhamento seja alterado "on the run" sem modificar o hardware.

Como no Openflow com MMT, o programador pode especificar match tables, limitado somente ao tamanho do hardware, porém, o RMT permite modificar todos os header fields do pacote com mais flexibilidade.

RMT Consists of:

11

•Reconfigurable parser •Support field definition

•Resizable match •VLIW: very long instruction word

•Configurable output queue

How is it configured?

RMT Parse Graph and Table Graph

12

Compiler and control protocol not available yet.

But the Parse Graph and Table

Graph don’t show you how to build

a switch

13

Performance vs Flexibility

• Multiprocessor: memory bottleneck • Change to pipeline • Fixed function chips specialize processors • Flexible switch needs general purpose CPUs

14

Memory

Memory

Memory CPU

CPU

CPU

L2 L3 ACL

Access Control List (ACL)

Memory

C P

U

C

PU

CPU

C P

U

C

PU

CPU

C

PU

CPU

CP

U

How We Did It • Memory to CPU bottleneck

• Replicate CPUs

• More stages for finer granularity

• Higher CPU cost ok

15

C P

U

C

PU

CPU

Antes, definições de memória

• RAM –

▫ binário (0, 1), busca em loop por endereços, ou hash table, busca lenta.

• CAM – Content Addressable Memory –

▫ binário, associative memory, o conteudo é buscado em todos os campos ao mesmo tempo, busca rapida.

• TCAM – Ternary CAM –

▫ ternária (0, 1, X), wildcard match, busca rápida. Prefix matching.

16

RMT Consists of:

17

Parser

18

•Recebe o pacote de produz o header vector de 4kb.

•O parse é feito baseado num gráfico provido pelo usuário convertido em entradas numa TCAM de 256 x 40b entradas. •Cada interação tem um tamanho de campo de header específico e gera um campo no vetor de saida, o parser volta ao inicio para tratar o próximo campo. •O processo de parse é feito sequencialmente até que todo o header seja trabalhado, e todo o vector header esteja pronto.

32x Match

Stages

• Cada estágio fisico tem memórias SRAM e TCAM e Action CPUs associadas independentes. ▫ Cada physical match stage tem

subunidades de SRAM (8x 80b) 640b e de TCAM (16x 40b) 640b.

• Cada subunidade pode trabalhar: ▫ independentemente, ▫ em grupos para maiores campos

maiores, ▫ ou em conjuntos para tabelas mais

extensas.

• Cada match stage tem adicionais 106 RAM Blocks de 1k x 112b e 16 blocos de TCAM de 2k x 40b. ▫ A fração configurada para memórias de

match, action e estatistica é configurável. ▫ Todas as leituras são realizadas em

paralelo.

• Em cada entrada de match table na RAM está associado um ponteiro para a action memory e size, instruction memory e um next table address.

19

TCAM

640b

640b

Physical Stage n

Physical Stage 2

Logical Table 1

Ethertype

Logical Table 6 L2D

8 UDP

2 VLA

N

3 IPV4

5 IPV6

4 L2S

7 TCP SRAM HASH

Physical Stage 1

RMT Logical to Physical Table Mapping

20

ACL

UDP TCP

L2S

L2D

IPV4

ETH

VLAN

IPV6

9 ACL

Table Graph

Act

ion

Ma

tch

Ta

ble

Act

ion

Ma

tch

Ta

ble

Act

ion

Ma

tch

Ta

ble

Access Control List (ACL)

Instruction

ALU

Match result

Action Processing Model

21

Hea

der

In

F

ield

F

ield

Data

Hea

de

r O

ut

ALU

VLIW Instructions Match result

Modeled as Multiple VLIW CPUs per Stage

ALU ALU ALU ALU ALU ALU ALU ALU

22

VLIW Instructions

23

Reducing Latency

• Dependency Analysis

Hardware Implementation • 64 x 10Gb ports

▫ 960M packets/second

▫ 1GHz pipeline

• Packet header Parser

▫ 4Kb vector

▫ 256*40b TCAM,

• Physical Stages N=32

▫ 106 blocks SRAM with 1K*112bits: for 80bits width Hash

▫ 16 blocks TCAM with 2K*40bits

▫ Total:370Mb SRAM, 40Mb TCAM, support combine in depth or width

• Action units

– 200+ for each stage

– Each for one field

– 7k+ units total

• Queues

– 2K queues per port

– 4 levels of hierarchy

– Allowing various combinations of deficit round robin, hierarchical fair queuing, token buckets, priorities.

Cost of Configurability:

Comparison with Conventional Switch

• Many functions identical: I/O, data buffer, queueing…

• Make extra functions optional: statistics

• Memory dominates area ▫ Compare memory area/bit and bit count

• RMT must use memory bits efficiently to compete on cost

• Techniques for flexibility

26

Chip Comparison with Fixed Function Switches

Section Area % of chip Extra Cost

IO, buffer, queue, CPU, etc 37% 0.0%

Match memory & logic 54.3% 8.0%

VLIW action engine 7.4% 5.5%

Parser + deparser 1.3% 0.7%

Total extra area cost 14.2%

27

Section Power % of chip

Extra Cost

I/O 26.0% 0.0%

Memory leakage 43.7% 4.0%

Logic leakage 7.3% 2.5%

RAM active 2.7% 0.4%

TCAM active 3.5% 0.0%

Logic active 16.8% 5.5%

Total extra power cost 12.4%

Area

Power

Conclusion

• How do we design a flexible chip? ▫ The RMT switch model ▫ Bring processing close to the memories: pipeline of many stages

▫ Bring the processing to the wires: 224 action CPUs per stage

• How much does it cost? ▫ 15% more than existing switches

28

Download - Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN

Top Related