programming line-rate switches talks...pkt.src: 0 stage 1 stage 2 assign each codelet to one atom....

Programming Line-Rate Switches

Anirudh Sivaraman

Joint work with• MIT: Suvinay Subramanian, Hari Balakrishnan, Mohammad

Alizadeh• Barefoot Networks: Changhoon Kim, Anurag Agrawal, Steve

Licking, Mihai Budiu• Cisco Systems: Shang-Tse Chuang, Sharad Chole, Tom Edsall• Microsoft Research: George Varghese• Stanford University: Sachin Katti, Nick McKeown• University of Washington: Alvin Cheung

Traditional networking

Fixed (simple) switches and programmable (smart) end points

This is showing signs of age …• Switch features tied to ASIC design cycles (2-3 years)• Long lag time for new protocol formats (IPv6, VXLAN)

• Operators (esp. in datacenters) need more control over switches• Access control, load balancing, bandwidth sharing, measurement

• Many switch algorithms never make it to production

The quest for programmable switches• Early switches built out of minicomputers, which were sufficient• IMPs (1969): Honeywell DDP-516• Fuzzball (1971): DEC LSI-11• Stanford multiprotocol switch (1981): DEC PDP 11• Proteon / MIT C gateway (1980s): DEC MicroVAX II

The quest for programmable switches

Catalyst

Broadcom5670

ScorpionTrident

TridentIITomahawk

SNAP(ActivePackets)

Click(CPU)

IXP2400(NPU)

RouteBricks(multi-core)

PacketShader(GPU)

NetFPGA-SUME(FPGA)

1999 2000 2002 2004 2007 2009 2010 2012 2014

Gbit/s

Single-ChipAggregateSwitchingCapacity

Hardwareswitch

Softwareswitch

Software switches (CPUs, NPUs, GPUs, FPGAs) are 10—100x slower

The vision: programmability at line rate• Performance of fastest, fixed-function switches (> 1 Tbit/s)

• More programmable than fixed-function switches• Much more than OpenFlow/SDN, which only programs routing/control plane.• …, but less than software switches

• Such programmable chips are emerging: Tofino, FlexPipe, Xpliant• As are languages such as P4 to program them

This Talk

• The machine model: Formalizes the computational capabilities of line-rate switches

• Packet transactions: High-level programming for the switch pipeline

• Push-In First-Out Queues: Programming the scheduler

Queues/Scheduler

Parser DeparserIngress pipeline Egress pipeline

match/action match/action match/action match/action match/action

This talk

TCP New

IPv4 IPv6

VLANEth

A machine model for line-rate switches

pipeline

match/action

Stage 1

match/action

Stage 2

match/action

Stage 16

pipeline

Stage 1 Stage 2 Stage 16

state state action unit state action

unitaction

state action unit state action

unitstate action unit

pipeline

Typical requirement: 1 pkt / nanosecond

unitstate action unit

• Atom: smallest unit of atomic packet/state updateStage 1 Stage 2 Stage 16

unit state action unitX constant

Add Mul

2-to-1 Mux

choice

A switch’s atoms constitute its instruction set

Stateless vs. stateful operations

pkt.tmp =pkt.f1 + pkt.f2

Stateless operation: pkt.f4 = pkt.f1 + pkt.f2 – pkt.f3

pkt.f4 = pkt.tmp - pkt.f3

f3f4 =

tmp – f3tmp = f1

tmp = f1 + f2

tmpCan pipeline stateless operations

Stateless vs. stateful operationsStateful operation: x = x + 1

pkt.tmp = x pkt.tmp ++ x = pkt.tmptmp tmp= 0

tmp= 1

tmp tmp= 0

tmp= 1

X = 1X = 0

X should be 2,

not 1!

Stateless vs. stateful operationsStateful operation: x = x + 1

X++tmp

Cannot pipeline, need atomic operation in h/w

Stateful atoms can be fairly involved

2-to-1 Mux

2-to-1 Mux x

Const3-to-1 Mux

2-to-1 Mux

Const3-to-1 Mux

2-to-1 Mux

Const3-to-1 Mux

2-to-1 Mux

Const3-to-1 Mux

2-to-1 Mux

Const3-to-1 Mux

2-to-1 Mux

Const3-to-1 Mux

2-to-1 Mux

Const3-to-1 Mux

2-to-1 Mux

x0 Update state in one of

four ways based on four predicates.

Each predicate can itself depend on the

state.

This Talk

Queues/Scheduler

This talk

TCP New

IPv4 IPv6

VLANEth

Packet transactions• Packet transaction: block of imperative code• Transaction runs to completion, one packet at a time, serially

if (count == 9):pkt.sample = pkt.srccount = 0

else :pkt.sample = 0count++

count p1.sample = 0

p2.sample = 0p1

p2 01290

p10 p10.sample = 1.2.3.4

Packet transactions are expressiveAlgorithm LOC

Bloom filter 29

Heavy hitter detection 35

Rate-ControlProtocol

Flowlet switching 37

Sampled NetFlow 18

HULL 26

Adaptive Virtual Queue 36

CONGA 32

CoDel 57

Compiling packet transactionsPacket Sampling Pipeline

pkt.old = count;pkt.tmp = pkt.old == 9;pkt.new = pkt.tmp ? 0 : (pkt.old + 1);count = pkt.new;

pkt.sample = pkt.tmp ?pkt.src :

Stage 2Stage 1

Packet Sampling Algorithm

Compiler

Reject code that can’t be mapped

if (count == 9):pkt.sample = pkt.srccount = 0

else:pkt.sample = 0count++

Create one node for each

instruction

(1) Serial code to codelet pipelinepkt.old = count

pkt.tmp = pkt.old == 9

pkt.new = pkt.tmp ? 0 : (pkt.old + 1)

count = pkt.new

pkt.sample = pkt.tmp ?pkt.src : 0

Packet field dependencies

pkt.old = count

count = pkt.new

(1) Serial code to codelet pipeline

State dependencies

pkt.old = count

count = pkt.new

Strongly connected

components

pkt.old = count

count = pkt.new

pkt.old = countpkt.tmp = pkt.old == 9

pkt.new = pkt.tmp ? 0 : (pkt.old + 1);count = pkt.new Condensed DAG

Code pipelining

Stage 2Stage 1

(2) Codelets to atoms

Stage 2Stage 1

Assign each codeletto one atom.

Reject if you run out of atoms.

choiceAdd

(2) Codelets to atoms

x = x * x doesn’t map, reject codex = x + 1 maps to this atom

X constant

Add Mul

2-to-1 Mux

Determines if algorithm can run at line rate

Use program synthesis for

mapping problem.

Algorithm

DominoCompiler

Pipelinegeometry

Algorithm doesn’t compile?

Modify pipeline geometry or atom.

The compiler as a tool for switch design

Algorithm compiles

Move on to another algorithm

LeastExpressive

MostExpressive

Atom Description

R/W Read or write stateRAW Read, add, and write

backPRAW Predicated version of

RAWIfElseRAW

2 RAWs, one each when a predicate is true or false

Sub IfElseRAW with a statefulsubtraction capability

Nested 4-way predication (nests2 IfElseRAWs)

Pairs Update a pair of state variables

Stateful atoms for programmable switches

Compiling packet transactions to atomsAlgorithm

Bloom filter

Heavy hitter detection

Rate-ControlProtocolFlowlet switching

Sampled NetFlow

Adaptive Virtual Queue

Compiling packet transactions to atomsAlgorithm Most expressive

stateful atom requiredBloom filter R/W

Heavy hitter detection RAW

Flowlet switching PRAW

Sampled NetFlow IfElseRAW

HULL Sub

Adaptive Virtual Queue Nested

CONGA Pairs

CoDel Doesn’t map

Algorithm Most expressivestateful atom required

PipelineDepth

PipelineWidth

Bloom filter R/W 4 3

Heavy hitter detection RAW 10 9

PRAW 6 2

Flowlet switching PRAW 3 3

Sampled NetFlow IfElseRAW 4 2

HULL Sub 7 1

Adaptive Virtual Queue Nested 7 3

CONGA Pairs 4 2

CoDel Doesn’t map 15 3

~100 atom instances are sufficient

Compiling packet transactions to atoms

Programmability adds modest cost• All atoms meet timing at 1 GHz in a 32-nm library.• They occupy modest additional area relative to a switching chip.

Atom Description Atom area(micro m^2)

Area for 100 atoms relative to 200 mm^2 chip

R/W Read or write state 250 0.0125%RAW Read, add, and write back 431 0.022%PRAW Predicated version of RAW 791 0.039%IfElseRAW 2 RAWs, one each when a

predicate is true or false985 0.049%

Sub IfElseRAW with a statefulsubtraction capability

1522 0.076%

Nested 4-way predication (nests 2 IfElseRAWs)

3597 0.179%

Pairs Update a pair of state variables 5997 0.30%<1 % additional area for 100 atom instances

This Talk

Queues/Scheduler

This talk

TCP New

IPv4 IPv6

VLANEth

Why is programmable scheduling hard?• Many algorithms, yet no consensus on abstractions, cf.• Parse graphs for parsing• Match-action tables for forwarding• Packet transactions for data-plane algorithms

• Scheduler has tight timing requirements• Can’t simply use an FPGA/CPU

Need expressive abstraction that can run at line rate

What does the scheduler do?It decides• In what order are packets sent• e.g., FCFS, priorities, weighted fair queueing

• At what time are packets sent• e.g., Token bucket shaping

A strawman programmable scheduler

• Very little time on the dequeue side => limited programmability• Can we move programmability to the enqueue side instead?

Classification Programmable logic to decide order or time

Packets

The Push-In First-Out QueueKey observation• In many schedulers, relative order of buffered packets does not

change• i.e., a packet’s place in the scheduling order is known at enqueue

The Push-In First-Out Queue (PIFO): Packets are pushed into an arbitrary location based on a rank, and dequeued from the head

259 791013

A programmable schedulerTo program the scheduler, program the rank computation

Rank Computation

(programmable) (fixed logic)

29 8 5

PIFO Scheduler

f = flow(pkt) …...p.rank= T[f] + p.len

In Out

Parser DeparserIngress pipeline Egress pipelineQueues/

SchedulerPIFO

Scheduler

A programmable scheduler

Rank computation is a packet transaction

Rank Computation…………

Parser Ingress pipeline Egress pipelineQueues/

SchedulerPIFO

Scheduler

Fair queuingDeparser

1. f = flow(p)2. p.start = max(T[f].finish,

virtual_time)3. T[f].finish = p.start + p.len4. p.rank = p.start

Rank Computation

In Out

SchedulerPIFO

Scheduler

Token bucket shaping

1. tokens = min(tokens + rate * (now – last), burst)

2. p.send = now + max( (p.len – tokens) / rate, 0)

3. tokens = tokens - p.len4. last = now5. p.rank = p.send

Rank Computation

29 8 5

PIFO Scheduler

Shortest remaining flow size

In Out

SchedulerPIFO

Scheduler

Shortest remaining flow size

29 8 5

PIFO Scheduler

Rank Computation 1. f = flow(p)2. p.rank = f.rem_size

Beyond a single PIFO

Hierarchical scheduling algorithms need hierarchy of PIFOs

Red (0.5) Blue (0.5)

a(0.99)

b(0.01)

x(0.5)

y(0.5)

HierarchicalPacket Fair Queuing (HPFQ)

Beyond a single PIFO

Hierarchical scheduling algorithms need hierarchy of PIFOs

Red (0.5) Blue (0.5)

a(0.99)

b(0.01)

x(0.5)

y(0.5)

Tree of PIFOs

Red (0.5) Blue (0.5)

a(0.99)

b(0.01)

x(0.5)

y(0.5)

PIFO-Red(WFQ on a & b)

PIFO-root (WFQ on Red & Blue)

PIFO-Blue(WFQ on x & y)

BRBB RRBR

Expressiveness of PIFOs• Fine-grained priorities: shortest-flow first, earliest deadline first,

service-curve EDF• Hierarchical scheduling: HPFQ, Class-Based Queuing• Non-work-conserving algorithms: Token buckets, Stop-And-Go,

Rate Controlled Service Disciplines• Least Slack Time First• Service Curve Earliest Deadline First• Minimum and maximum rate limits on a flow• Cannot express some scheduling algorithms, e.g., output

shaping.

PIFO in hardware• Performance targets for a shared-memory switch• 1 GHz pipeline (64 ports * 10 Gbit/s)• 1K flows/physical queues• 60K packets (12 MB packet buffer, 200 byte cell)• Scheduler is shared across ports

• Naive solution: flat, sorted array is infeasible

• Exploit observation that ranks increase within a flow

A single PIFO block

Rank Store(SRAM)

Flow Scheduler(flip-flops)

Dequeue Enqueue

A 0 B 1 C 3 C24

345 C 6D 4

Hardware feasibility• The rank store is a bank of FIFOs, used commonly to buffer data

• Flow scheduler for 1K flows meets timing at 1 GHz on a 16-nm transistor library• Continues to meet timing until 2048 flows, fails timing at 4096

• 7 mm2 area for 5-level programmable hierarchical scheduler• < 4% relative to a 200 mm2 baseline chip

A blueprint for programmable switches• High-performance networking needs specialized hardware

• Tension between specialization and programmability

• Tailor abstractions to restricted classes of switch functions• Stateful header processing without loops: Packet transactions, atoms• Scheduling: PIFOs• Network diagnostics/measurement: Performance queries (HotNets 2016)

• Software and papers: • http://web.mit.edu/domino (Packet transactions)• http://web.mit.edu/pifo (PIFOs)

Backup slides

FAQ• How is this different from P4?• When we started this work a year ago, P4 was much closer to the

hardware. Over time, it’s gotten more high-level, thanks in some part to this work (sequential semantics, ternary operators).• Even now, however, P4 doesn’t provide transactional or atomic

semantics.• We do have a P4 backend.

• Why a pipeline?• NPUs have a shared-memory architecture, but sharing memory is hard

and slows down the switch.

FAQ• What’s in the compiler?

• Strongly Connected Components to extract atomic portions.• Code generation using program synthesis.

• Do the atoms generalize?• We don’t know for sure. We designed the atoms and were able to tweak them

a little bit to serve more algorithms. But this is something we don’t yet have a handle on.

• Is someone implementing it?• We are tabling a proposal on @atomic for P4.• There’s industry interest in PIFO, but no one I know actively working on it.

The SKETCH algorithm

• We have an automated search procedure that configures the atoms appropriately to match the specification, using a SAT solver to verify equivalence.• This procedure uses 2 SAT solvers:1.Generate random input x.2.Does there exist configuration such that spec and impl. agree on

random input?3.Can we use the same configuration for all x?4.If not, add the x to set of counter examples and go back to step 1.

Relationship to prior compiler techniquesTechnique Prior work DifferencesIf Conversion Kennedy et al. 1983 No breaks, continue, gotos, loopsStatic Single-Assignment Ferrante et al. 1988 No branchesStrongly Connected Components

Lam et al. 1989 (Software Pipelining)

Scheduling in space instead of time

Synthesis for instruction mapping

Technology mapping Map to 1 hardware primitive, not multiple

Superoptimization Counter-example-guided, not brute force

Hardware feasibility of PIFOs• Number of flows handled by a PIFO affects timing.

• Number of logical PIFOs within a PIFO, priority and metadata width, and number of PIFO blocks only increases area.

Static Single-Assignmentpkt.id = hash2(pkt.sport, pkt.dport) % NUM_FLOWLETS;pkt.last_time = last_time[pkt.id];...pkt.last_time = pkt.arrival;last_time[pkt.id] = pkt.last_time ;

pkt.id0 = hash2(pkt.sport, pkt.dport) % NUM_FLOWLETS;pkt.last_time0 = last_time[pkt.id0];...pkt.last_time1 = pkt.arrival;…last_time [pkt.id0] = pkt.last_time1 ;

Sequential to parallel code Hardware constraintsCanonicalization

Expression Flatteningpkt.tmp = pkt.arrival - last_time[pkt.id] > THRESHOLD;saved_hop [ pkt . id ] = pkt.tmp

? pkt . new_hop: saved_hop [ pkt . id ];

pkt.tmp = pkt.arrival - last_time[pkt.id];pkt.tmp2 = pkt.tmp > THRESHOLD;saved_hop [ pkt . id ] = pkt.tmp2

Instruction mapping: results• Generic method to handle fairly complex templates

• Templates determine if a Domino program can run at line rate.

• Example results:• Flowlet switching needs conditional execution to save next hop

information:saved_hop[pkt.id] = pkt.tmp2 ? pkt.new_hop : saved_hop[pkt.id]

• Simple increment suffices for heavy-hitter detectioncount_min_sketch[hash] = count_min_sketch[hash] + 1

Generating P4 code• Required changes to P4• Sequential execution semantics (required for read from, modify, and

write back to state)• Expression support• Both available in v1.1

• Encapsulate every codelet in a table’s default action• Chain together tables as P4 control program

Branch Removal

pkt.tmp = pkt.arrival - last_time[pkt.id] > THRESHOLD;saved_hop [ pkt . id ] = pkt.tmp

if (pkt.arrival - last_time[pkt.id] > THRESHOLD) {saved_hop [ pkt . id ] = pkt . new_hop ;

Handling State Variablespkt.id = hash2(pkt.sport, pkt.dport) % NUM_FLOWLETS;...last_time[pkt.id] = pkt.arrival;…

pkt.id = hash2(pkt.sport, pkt.dport) % NUM_FLOWLETS;pkt.last_time = last_time[pkt.id]; // Read flank...pkt.last_time = pkt.arrival;…last_time[pkt.id] = pkt.last_time; // Write flank

Instruction mapping: the SKETCH algorithm• Map each codelet to an atom template• Convert codelet and template both to functions of bit vectors• Q: Does there exist a template config s.t.

for all inputs,codelet and template functions agree?

• Quantified boolean satisfiability (QBF) problem• Use the SKETCH program synthesis tool to automate it

FAQ• Does predication require you to do twice the amount of work (for both the if and

the else branch)?• Yes, but it’s done in parallel, so it doesn’t affect timing.• The additional area overhead is negligible.

• What do you do when code doesn’t map?• We reject it and the programmer retries

• Why can’t you give better diagnostics?• It’s hard to say why a SAT solver says unsatisfiable, which is at the heart of these issues.

• Approximating square root.• Approximation is a good next step, especially for algorithms that are ok with sampling.

• How do you handle wrap arounds in the PIFO?• We don’t right now.

• Is the compiler optimal?• No, it’s only correct.

The Domino compiler

Branch Removal

Domino

Handle state variablesCode Pipelining Instruction Mapping

Processing Pipeline

Canonicalization(Sequential Code)

Sequential toparallel code

Respecting hardwareconstraints

Code Pipeliningpkt.id=hash2(pkt.sport,

pkt.dport)%NUM_FLOWLETS

pkt.last_time =last_time[pkt.id]

pkt.tmp =pkt.arrival –pkt.last_time last_time[pkt.id]=pkt.arrival

pkt.tmp2=pkt.tmp >THRESHOLD

pkt.saved_hop =saved_hop[pkt.id]

pkt.next_hop =pkt.tmp2?pkt.new_hop :pkt.saved_hop

saved_hop[pkt.id]=pkt.tmp2?pkt.new_hop:pkt.saved_hop

pkt.next_hop =hash3(pkt.sport,pkt.dport,pkt.arrival)%NUM_HOPS

Pair up read/writeflanks

Condense stronglyconnected componentsinto codelets

Add packet-fielddependencies

Programming with Packet Transactions#define NUM_FLOWLETS 8000#define THRESHOLD 5#define NUM_HOPS 10

struct Packet { int sport; int dport; …};

int last_time [NUM_FLOWLETS] = {0};int saved_hop [NUM_FLOWLETS] = {0};

void flowlet(struct Packet pkt) {pkt.new_hop = hash3(pkt.sport, pkt.dport, pkt.arrival)

% NUM_HOPS;pkt.id = hash2(pkt.sport, pkt.dport) % NUM_FLOWLETS;if (pkt.arrival - last_time[pkt.id] > THRESHOLD) {

saved_hop[pkt.id] = pkt.new_hop;}last_time[pkt.id] = pkt.arrival;pkt.next_hop = saved_hop[pkt.id];

PipelineDominopkt.new_hop =hash3(pkt.sport,

pkt.dport,pkt.arrival)

%NUM_HOPS;

pkt.id =hash2(pkt.sport,

pkt.dport)% NUM_FLOWLETS

pkt.last_time = last_time[pkt.id];last_time[pkt.id] = pkt.arrival;

pkt.tmp = pkt.arrival – pkt.last_time;

pkt.tmp2 = pkt.tmp > 5;

pkt.saved_hop = saved_hop[pkt.id];saved_hop[pkt.id] = pkt.tmp2 ?

pkt.new_hop :pkt.saved_hop;

pkt.next_hop = pkt.tmp2 ?pkt.new_hop :pkt.saved_hop ;

Stage 1

Stage 2

Stage 3

Stage 4

Stage 6

Stage 5

The quest for programmability

Switch Year Line-rateCisco Catalyst 199

932 Gbit/s

Broadcom 5670 2004

80 Gbit/s

Broadcom Scorpion 2007

240 Gbit/s

Broadcom Trident 2010

640 Gbit/s

Broadcom Tomahawk 2014

3.2 Tbit/sProgrammability => 10--100x slower than line rate.

System Year Substrate

Performance

Click 2000 CPUs 170 Mbit/sIntel IXP 2400 2002 NPUs 4 Gbit/sRouteBricks 2009 Multi-core 35 Gbit/sPacketShader 2010 GPUs 40 Gbit/sNetFPGASUME

2014 FPGA 100 Gbit/s

The quest for programmability

Programmability => 10--100x slower than line rate.

Multi-core GPUFPGA

CatalystBroadcom 5670

ScorpionTrident

Tomahawk

1999 2001 2003 2005 2007 2009 2011 2013

Performance scaling of switches

Software routers Line-rate routers

Compiler targets: diagram

Operation:+, -, >, <,

AND, OR

pkt.f1/constantpkt.f2/constant

pkt.f3

constantx2-to-1

constantx

2-to-1Mux

Why are switches pipelined?

Performance requirements at line rate• Aggregate capacity ~ 1 Tbit/s

• Packet size ~ 1000 bits

• 10 operations per packet: routing, access control (ACL), tunnels, …

Need to process 1 billion pkts/s, 10 ops per packet

Single processor architecture

1: route lookup2: ACL lookup3: tunnel lookup...10: …

10 GHz processor

Packets

match/action

Can’t build a 10 GHz processor!

Packet-parallel architecture

1 GHz processor

Packets

match/action

Packet-parallel architecture

1 GHz processor

Packets

match/action match/action match/action match/action

Memory replication increases die area

Function-parallel or pipelined architectureRoute lookup table

1 GHz circuit

ACL lookup table Tunnel lookup table

Packets

1 GHz circuit 1 GHz circuit

Route lookup ACL lookup Tunnel lookup

• Factors out global state into per-stage local state• Replaces full-blown processor with a circuit

match/action match/action match/action

P4 comparison

Programming with packet transactionsAlgorithm LOC P4 LOC

Bloom filter 29 104

Heavy hitter detection 35 192

Flowlet switching 37 107

Sampled NetFlow 18 70

HULL 26 95

Adaptive Virtual Queue 36 147

CONGA 32 89

CoDel 57 271

programming line-rate switches talks...pkt.src: 0 stage 1 stage 2 assign each codelet to one atom....

Documents

atoms, molecules and ions - ksu faculty · atoms, molecules...

chapter 3 – atoms and moles - grygla public · pdf...

introducing atoms and x-rays...centre for radiation,...

what is a mole?. one mole = 6.022 x 10 23 items a mole of...

x-rays nature and origin of x-rays interaction of x-rays...

1 chapter 2 atoms, molecules and ions. 2 atoms each...

w.n. catford/p.h. regan 1amq 83 characteristic x-rays and...

x-rays are electromagnetic waves (like light) and have a...

coronal hard x-rays, cmeless x-class flares, and fast atoms...

x-ray scattering and fluorescence from atoms and …...

atoms. matter elements atoms atoms are the building blocks...

c three calculations with chemical formulas and...

(x-ray crystallography) x-ray diffraction. i. x-ray...

kmbt 654-20131024112244 · how many atoms are there in 2.3...

chemistry. atomic structure atoms, atoms, atoms!!!!!!!!

a design/study of a self-aware codelet runtime...

how many moles of ca are 3 . 01 x 10 23 atoms of calcium?

cere: llvm based codelet extractor and replayer for

atoms, x-rays and synchrotron...

periodic properties electron configurations properties...