![Page 1: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/1.jpg)
IP Lookup: Application Requirements and concurrency issues
Arvind Computer Science & Artificial Intelligence Lab
Massachusetts Institute of Technology
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-1
![Page 2: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/2.jpg)
IP Lookup block in a router
QueueManager
Packet Processor
Exit functions
ControlProcessor
Line Card (LC)
IP Lookup
SRAM(lookup table)
Arbitration
Switch
LC
LC
LC
§ A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing table
§ Line rate and the order of arrival must be maintained line rate Þ 15Mpps for 10GE
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-2
![Page 3: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/3.jpg)
18
IP address Result M Ref7.13.7.3 F 210.18.201.5 F 37.14.7.25.13.7.2 E 110.18.200.7 C 4
Sparse tree representation
3
A…
A…
B
C…
C…
5 D
F…
F…
14
A…
A…
7
F…
F…
200
F…
F…
F*
E5.*.*.*
D10.18.200.5
C10.18.200.*
B7.14.7.3
A7.14.*.* F…F…
F
F…
E5
7
10
255
0
4A In this lecture:Level 1: 16 bits Level 2: 8 bits Level 3: 8 bits
Þ 1 to 3 memoryaccesses
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-3
![Page 4: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/4.jpg)
“C” version of LPMintlpm (IPA ipa) /* 3 memory lookups */{ int p;
/* Level 1: 16 bits */p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p);/* Level 2: 8 bits */p = RAM [ptr(p) + ipa [15:8]]; if (isLeaf(p)) return value(p);/* Level 3: 8 bits */p = RAM [ptr(p) + ipa [7:0]]; return value(p); /* must be a leaf */
}
Not obvious from the C code how to deal with
- memory latency- pipelining
…
216 -1
0
…
…28 -1
0
…
28 -1
0
§ Must process a packet every 1/15 ms or 67 ns§ Must sustain 3 memory dependent lookups in 67 ns
Memory latency ~30ns to 40ns
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-4
![Page 5: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/5.jpg)
Longest Prefix Match for IP lookup:3 possible implementation architectures
Rigid pipeline
Inefficient memory usage but simple design
Linear pipeline
Efficient memory usage through memory port replicator
Circular pipeline
Efficient memory with most complex control
Designer’s Ranking: 1 2 3
Which is “best”?September 25, 2019 http://csg.csail.mit.edu/6.375 L09-5Arvind, Nikhil, Rosenband & Dave [ICCAD 2004]
![Page 6: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/6.jpg)
IP-Lookup module: Circular pipeline
done?RAM
fifo
enter
getResult
cbufput
no
getToken
§ Completion buffer ensures that departures take place in order even if lookups complete out-of-order
§ Since cbuf has finite capacity it gives out tokens to control the entry into the circular pipeline
§ The fifo must also hold the “token” while the memory access is in progress: Tuple2#(Token,Bit#(16))
remainingIP
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-6
![Page 7: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/7.jpg)
Addr
Readyctr
(ctr > 0) ctr++
ctr--
deq
Enableenq
Request-Response Interface for Synchronous Memory
Synch MemLatency N
interface Mem#(type addrT, type dataT);method Action req(addrT x);method Action deq;method dataT peek;
endinterface
Data
Ack
DataReady
req
deq
peek
Use a BSV wrapper to make a synchronous component latency- insensitive
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-7
![Page 8: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/8.jpg)
Completion bufferinterface CBuffer#(type t);
method ActionValue#(Token) getToken; method Action put(Token tok, t d);method ActionValue#(t) getResult;
endinterface
cbuf
getResult
getToken
put
§ Completion buffer is used to restore the order in which the processing of inputs was started § Tokens are given out in order, e.g.,
(1,2,3,…,16,1,2,…)§ Data with a token can be put in any order in cbuf§ Results are returned in the same order in which tokes
were issued
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-8
![Page 9: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/9.jpg)
IP-Lookup module: Interface methods
module mkIPLookup(IPLookup);instantiate cbuf, RAM and fiforule recirculate… ; method Action enter (IP ip);
Token tok <- cbuf.getToken;ram.req(ip[31:16]);fifo.enq(tuple2(tok,ip[15:0]));
endmethodmethod ActionValue#(Msg) getResult(); let result <- cbuf.getResult;return result;
endmethodendmodule
When can enter fire?cbuf, ram & fifo, each has space (is rdy)
done?RAM
fifo
enter
getResultcbufput
no
getToken
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-9
![Page 10: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/10.jpg)
Circular Pipeline Rules:
When can recirculate fire?
ram & fifoeach has an element and ram andfifo, orcbuf has space
Requires simultaneous enq and deq in the same rule!Is this possible?
done? Is the same as isLeaf
rule recirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq;if (isLeaf(ram.peek))
cbuf.put(tok, ram.peek); else begin
fifo.enq(tuple2(tok,(rip << 8)));ram.req(ram.peek + rip[15:8]);
endendrule
done?RAM
fifo
put
no
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-10
![Page 11: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/11.jpg)
Performance
Can a new request enter the system when an old one is leaving?
Is this worth worrying about?
Norule recirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq;if (isLeaf(ram.peek))
cbuf.put(tok, ram.peek); else beginfifo.enq(tuple2(tok,(rip << 8)));ram.req(ram.peek + rip[15:8]);
endendrule
method Action enter (IP ip);Token tok <- cbuf.getToken;ram.req(ip[31:16]);fifo.enq(tuple2(tok,ip[15:0]));endmethod
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-11
conflict
Dead cycle
![Page 12: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/12.jpg)
The Effect of Dead Cycles
What is the performance loss if “exit” and “enter” can’t ever happen in the same cycle?
>33% slowdown! Unacceptable
Circular Pipeline§ RAM takes several cycles to respond to a request § Each IP request generates 1-3 RAM requests§ FIFO entries hold base pointer for next lookup and
unprocessed part of the IP address
done?RAM
fifo
enter
getResultcbufput
no
getToken
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-12
![Page 13: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/13.jpg)
Dead Cycles
rule recirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq;if (isLeaf(ram.peek))
cbuf.put(tok, ram.peek); else beginfifo.enq(tuple2(tok,(rip << 8)));ram.req(ram.peek + rip[15:8]);
endendrule
method Action enter (IP ip);Token tok <- cbuf.getToken;ram.req(ip[31:16]);fifo.enq(tuple2(tok,ip[15:0]));endmethod
In general enter and recirculate conflict but when isLeaf(p) is true there is no apparent conflict!
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-13
![Page 14: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/14.jpg)
Rule Spliting
rule foo;if (p) r1 <= 5;else r2 <= 7;
endrule
rule baz;r1 <= 9;
endrule
rule fooT if (p);r1 <= 5;
endrule
rule fooF if (!p);r2 <= 7;
endrule
rule baz;r1 <= 9;
endrule
º
§ rules foo and baz conflict§ rules fooF and baz do not
and can be scheduled together
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-14
![Page 15: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/15.jpg)
Splitting the recirculate rule
rule recirculate(!isLeaf(ram.peek));match{.tok,.rip} = fifo.first; fifo.enq(tuple2(tok,(rip << 8)));ram.req(ram.peek + rip[15:8]);fifo.deq; ram.deq;
endrule
rule exit (isLeaf(ram.peek));match{.tok,.rip} = fifo.first; cbuf.put(tok, ram.peek);fifo.deq; ram.deq;
endrule
Rule exit and method enter can execute concurrently, if cbuf.put and cbuf.getToken can execute concurrently
method Action enter (IP ip);Token tok <- cbuf.getToken;ram.req(ip[31:16]);fifo.enq(
tuple2(tok,ip[15:0]));endmethod
This rule is valid only if enq and deq can be executed concurrently
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-15
![Page 16: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/16.jpg)
Concurrent FIFO methodspipelined FIFO
rule foo;f.enq (5) ; f.deq;
endrule
º
§ f.notFull can be calculated only after knowing if f.deq fires or not, i.e. there is a combinational path from enable of f.deq to f.notFull
§ Firing condition for rule foo has to be independent of the body
rule foo (f.notFull && f.notEmpty);f.enq (5) ; f.deq;
endrule
make implicit conditions explicit
Can foo be enabled?
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-16
![Page 17: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/17.jpg)
Concurrent FIFO methodsCF FIFO
§ The firing condition for rule foo is independent of the body
§ The FIFO in the IP lookup must therefore be CF
rule foo;f.enq (5) ; f.deq;
endrule
º
rule foo (f.notFull && f.notEmpty);f.enq (5) ; f.deq;
endrule
make implicit conditions explicit
Can foo be enabled?
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-17
![Page 18: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/18.jpg)
module mkCFFifo (Fifo#(2, Bit#(n)));Reg#(t) da <- mkRegU(); Reg#(Bool) va <- mkReg(False);Reg#(t) db <- mkRegU(); Reg#(Bool) vb <- mkReg(False)rule canonicalize if (vb && !va);
da <= db; va <= True; vb <= False;endrulemethod Action enq(t x) if (!vb);
begin db <= x; vb <= True; endendmethodmethod Action deq if (va);
va <= False;endmethodmethod t first if (va);
return da; endmethod
endmodule
Two-Element FIFO
db da
vb vaEhr#(2, Bit#(n)) da <- mkEhr(?);Ehr#(2, Bool) va <- mkEhr(False);Ehr#(2, Bit#(n)) db <- mkEhr(?);Ehr#(2, Bool) vb <- mkEhr(False);rule canonicalize (vb[1] && !va[1]);
da[1] <= db[1]; va[1] <= True;vb[1] <= False; endrule
method Action enq(Bit#(n) x) if (!vb[0]);db[0] <= x; vb[0] <= True;
endmethod enq deq first canoenq C CF CF <deq CF C > <first CF < CF <cano > > > C
method Action deq if (va[0]);va[0] <= False;
endmethodmethod Bit#(n) first if (va[0]);
return da[0];
In any given cycle simultaneous enq and deq are permitted provided the FIFO is neither full nor empty
September 18, 2019 http://csg.csail.mit.edu/6.375 L07-18
![Page 19: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/19.jpg)
Completion buffer: Implementation
IIVIVI
cnt
iidx
cbData
ridx
§ A circular buffer with two pointers iidxand ridx, and a counter cnt
§ Each data element has a valid bit associated with it
module mkCompletionBuffer(CompletionBuffer#(t));Vector#(32, Reg#(Bool)) cbv <- replicateM(mkReg(False));Vector#(32, Reg#(t)) cbData <- replicateM(mkRegU());Reg#(Bit#(5)) iidx <- mkReg(0);Reg#(Bit#(5)) ridx <- mkReg(0);Reg#(Bit#(6)) cnt <- mkReg(0);
rules and methods...endmodule
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-19
![Page 20: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/20.jpg)
Completion Buffer contmethod ActionValue#(Bit#(5)) getToken() if (cnt < 32);cbv[iidx] <= False;iidx <= (iidx==31) ? 0 : iidx + 1;cnt <= cnt + 1;return iidx;
endmethod
method Action put(Token idx, t data);cbData[idx] <= data;cbv[idx] <= True;
endmethod
method ActionValue#(t) getResult() if ((cnt > 0)&&(cbv[ridx]));cbv[ridx] <= False;ridx <= (ridx==31) ? 0 : ridx + 1; cnt <= cnt – 1;return cbData[ridx];
endmethod
IIVIVI
cnt
iidx
cbData
ridx
Concurrency properties?
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-20
![Page 21: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/21.jpg)
Completion bufferConcurrency requirements
interface CBuffer#(type t);method ActionValue#(Token) getToken; method Action put(Token tok, t d);method ActionValue#(t) getResult;
endinterface
cbuf
getResult
getToken
put
§ For no dead cycles getToken, put and getResult must be able to execute concurrently
§ If we make these methods CF then every thing will work concurrently, i.e. (enter CF exit), (enter CF getResult) and(exit CF getResult)
§ However CF methods are hard to design. Suppose (getToken< put), (getToken < getResult) and (put < getResult) then(enter < exit), (enter < getResult) and (exit < getResult)
§ In fact, any ordering will work
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-21
![Page 22: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/22.jpg)
Longest Prefix Match for IP lookup:3 possible implementation architectures
Rigid pipeline
Inefficient memory usage but simple design
Linear pipeline
Efficient memory usage through memory port replicator
Circular pipeline
Efficient memory with most complex control
Which is “best”?
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-22
![Page 23: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/23.jpg)
Implementations of Static pipelinesTwo designers, two results
LPM versions Best Area(gates)
Best Speed(ns)
Static V (Replicated FSMs) 8898 3.60
Static V (Single FSM) 2271 3.56
Replicated
RAM
FSM
mux/de-mux
FSM FSM FSM
Counter
mux / de-muxresultIP addr
FSM
RAM
mux
result
IP addrBEST:
Each packet is processed by one FSM
Shared FSM
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-23
![Page 24: IP Lookup: Application Requirements andcsg.csail.mit.edu/6.375/6_375_2019_www/handouts/lectures/...Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC LC LC A packet](https://reader035.vdocuments.us/reader035/viewer/2022062403/61081f7a7dc0d778c1372a83/html5/thumbnails/24.jpg)
Synthesis resultsLPM versions
Code size(lines)
Best Area(gates)
Best Speed(ns)
Mem. util. (random workload)
Static V 220 2271 3.56 63.5%
Static BSV 179 2391 (5% larger) 3.32 (7% faster) 63.5%
Linear V 410 14759 4.7 99.9%
Linear BSV 168 15910 (8% larger) 4.7 (same) 99.9%
Circular V 364 8103 3.62 99.9%
Circular BSV 257 8170 (1% larger) 3.67 (2% slower) 99.9%
Synthesis: TSMC 0.18 µm lib
§ Bluespec results can match carefully coded Verilog§ Micro-architecture has a dramatic impact on performance§ Architecture differences are much more important than
language differences in determining QoR
V=Verilog
September 25, 2019 http://csg.csail.mit.edu/6.375 L09-24