boosting fib caching performance with aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a...

12
Boosting FIB Caching Performance with Aggregation Garegin Grigoryan Rochester Institute of Technology [email protected] Yaoqing Liu Fairleigh Dickinson University [email protected] Minseok Kwon Rochester Institute of Technology [email protected] ABSTRACT In the era of high-performance cloud computations, net- works need to ensure fast packet forwarding. This task is carried by TCAM forwarding chips that perform line-rate Longest Prefix Matches in a Forwarding Information Base (FIB). However, with the increasing number of prefixes in IPv4 and IPv6 routing tables the price of TCAM increases as well. In this work, we present a novel FIB compression technique by adding an aggregation layer into a FIB caching architecture. In Combined FIB Caching and Aggregation (CFCA), cache-hit ratio is maximized up to 99.94% with only 2.50% entries of the FIB, while the churn in TCAM is reduced by more than 40% compared to low-churn FIB aggregation techniques. CCS CONCEPTS Networks Network architectures; KEYWORDS Routing, FIB caching, FIB aggregation, TCAM overflow ACM Reference Format: Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon. 2020. Boosting FIB Caching Performance with Aggregation. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’20), June 23–26, 2020, Stockholm, Sweden. ACM, New York, NY, USA, 12 pages. https://doi.org/10. 1145/3369583.3392682 1 INTRODUCTION A network router looks up the Forwarding Information Base (FIB or a Forwarding Table) for selecting the next-hop of a packet via Longest Prefix Matching (LPM). With the mass adoption of distributed computing and hence increasing net- work traffic, performing LPM at a line-rate speed requires Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. HPDC ’20, June 23–26, 2020, Stockholm, Sweden © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7052-3/20/06. . . $15.00 https://doi.org/10.1145/3369583.3392682 expensive Ternary-Content Addressable Memory (TCAM) that provides parallel lookup through hundreds of thousands of FIB entries in a single CPU cycle. In the meantime, the size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]. The super-linear growth is mostly attributed to IPv4 prefix fragmentation for traffic engineering and multi-homing purposes [26, 37]. While the IPv4 routing table is projected to exceed 1M of in 2024, the size of an IPv6 table will at least double within the next 5 years [12]. FIB growth poses challenges for network opera- tors. First, a larger FIB means more costly TCAM memory with higher energy consumption [16, 18, 21, 27, 35]. Next, migrating from old to new hardware leads to higher opera- tional cost [37]. Based on our market analysis, the price of the Cisco ASR 9000 Series Line Card that can support up to 1M of TCAM entries can make up to 80% of the total cost of a router. Alternatively, increasing the space for IPv4 prefixes on TCAM at the cost of reducing the allocations for IPv6 prefixes [28] is problematic due to the rates of IPv6 growth. FIB aggregation has been proposed to tackle these chal- lenges by merging adjacent or overlapped prefixes with the same next-hop. FIB aggregation, however, increases BGP churns in which a BGP update causes multiple entry inser- tions, deletions, and updates. This is particularly detrimental to TCAM as an insertion may require up to 1,000 opera- tions [16]. Another technique to deal with the high cost of TCAM is FIB caching that utilizes the high skewness be- tween the number of popular and unpopular routes as stud- ied in [11, 20, 30]. For example, Programmable FIB Caching Architecture (PFCA) [14] achieves 99.8% cache-hit ratio with only 3% of the total routes in TCAM, where the primary FIB table (or FIB cache) resides. FIB caching techniques convert original prefixes to fragmented non-overlapping prefixes in order to avoid cache hiding where a less specific prefix hides more specific one in a secondary FIB table, causing errors in LPM matching [11, 14, 20, 23]. Many adjacent prefixes generated by this conversion, however, occupy additional space, increase cache-miss ratios, and ultimately increase lookup latencies. We integrate FIB aggregation and caching for small FIB size with efficient use of TCAM and a higher cache-hit ratio, but not compromising LPM matching correctness. To this end, we design Combined FIB Caching and Aggregation (CFCA in short) by introducing an aggregation layer into the Route Manager of the control plane in PFCA. Unlike other

Upload: others

Post on 18-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

Boosting FIB Caching Performance with AggregationGaregin Grigoryan

Rochester Institute of [email protected]

Yaoqing LiuFairleigh Dickinson University

[email protected]

Minseok KwonRochester Institute of Technology

[email protected]

ABSTRACTIn the era of high-performance cloud computations, net-works need to ensure fast packet forwarding. This task iscarried by TCAM forwarding chips that perform line-rateLongest Prefix Matches in a Forwarding Information Base(FIB). However, with the increasing number of prefixes inIPv4 and IPv6 routing tables the price of TCAM increasesas well. In this work, we present a novel FIB compressiontechnique by adding an aggregation layer into a FIB cachingarchitecture. In Combined FIB Caching and Aggregation(CFCA), cache-hit ratio is maximized up to 99.94% with only2.50% entries of the FIB, while the churn in TCAM is reducedby more than 40% compared to low-churn FIB aggregationtechniques.

CCS CONCEPTS• Networks→ Network architectures;

KEYWORDSRouting, FIB caching, FIB aggregation, TCAM overflowACM Reference Format:Garegin Grigoryan, Yaoqing Liu, andMinseok Kwon. 2020. BoostingFIB Caching Performance with Aggregation. In Proceedings of the29th International Symposium on High-Performance Parallel andDistributed Computing (HPDC ’20), June 23–26, 2020, Stockholm,Sweden. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3369583.3392682

1 INTRODUCTIONA network router looks up the Forwarding Information Base(FIB or a Forwarding Table) for selecting the next-hop of apacket via Longest Prefix Matching (LPM). With the massadoption of distributed computing and hence increasing net-work traffic, performing LPM at a line-rate speed requires

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’20, June 23–26, 2020, Stockholm, Sweden© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7052-3/20/06. . . $15.00https://doi.org/10.1145/3369583.3392682

expensive Ternary-Content Addressable Memory (TCAM)that provides parallel lookup through hundreds of thousandsof FIB entries in a single CPU cycle. In the meantime, thesize of a global FIB at backbone routers reached 820K IPv4and 80K IPv6 routes in 2020 [1]. The super-linear growthis mostly attributed to IPv4 prefix fragmentation for trafficengineering and multi-homing purposes [26, 37]. While theIPv4 routing table is projected to exceed 1M of in 2024, thesize of an IPv6 table will at least double within the next 5years [12]. FIB growth poses challenges for network opera-tors. First, a larger FIB means more costly TCAM memorywith higher energy consumption [16, 18, 21, 27, 35]. Next,migrating from old to new hardware leads to higher opera-tional cost [37]. Based on our market analysis, the price ofthe Cisco ASR 9000 Series Line Card that can support up to1M of TCAM entries can make up to 80% of the total cost ofa router. Alternatively, increasing the space for IPv4 prefixeson TCAM at the cost of reducing the allocations for IPv6prefixes [28] is problematic due to the rates of IPv6 growth.FIB aggregation has been proposed to tackle these chal-

lenges by merging adjacent or overlapped prefixes with thesame next-hop. FIB aggregation, however, increases BGPchurns in which a BGP update causes multiple entry inser-tions, deletions, and updates. This is particularly detrimentalto TCAM as an insertion may require up to 1,000 opera-tions [16]. Another technique to deal with the high cost ofTCAM is FIB caching that utilizes the high skewness be-tween the number of popular and unpopular routes as stud-ied in [11, 20, 30]. For example, Programmable FIB CachingArchitecture (PFCA) [14] achieves 99.8% cache-hit ratio withonly 3% of the total routes in TCAM, where the primary FIBtable (or FIB cache) resides. FIB caching techniques convertoriginal prefixes to fragmented non-overlapping prefixes inorder to avoid cache hiding where a less specific prefix hidesmore specific one in a secondary FIB table, causing errorsin LPM matching [11, 14, 20, 23]. Many adjacent prefixesgenerated by this conversion, however, occupy additionalspace, increase cache-miss ratios, and ultimately increaselookup latencies.We integrate FIB aggregation and caching for small

FIB size with efficient use of TCAM and a higher cache-hitratio, but not compromising LPM matching correctness. Tothis end, we design Combined FIB Caching and Aggregation(CFCA in short) by introducing an aggregation layer into theRoute Manager of the control plane in PFCA. Unlike other

Page 2: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

HPDC ’20, June 23–26, 2020, Stockholm, Sweden Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon

(a) Original FIB

Label Prefix Next-hop

A 129.10.124.0/24 1B 129.10.124.0/27 1C 129.10.124.64/26 1D 129.10.124.192/26 2

(b) Aggregated FIB

Label Prefix Next hop

A 129.10.124.0/24 1D 129.10.124.192/26 2

Table 1: Example of FIB aggregationFIB aggregation algorithms [10, 22], CFCA does not produceoverlapping routes, and thus prevents cache hiding, whilehandling BGP updates incrementally without re-aggregationof the entire FIB. Specifically, CFCA uses prefix binary treesand prevents prefixes from the same branch to be installedinto the cache and the main FIB. CFCA’s aggregation freesup significant space, dramatically increases cache-hit ratiosand stabilizes the cache compared to other FIB compressionmechanisms.

Our contributions are as follows:(1) We design a FIB aggregation algorithm compatible

with FIB caching. CFCA’s aggregation preserves theforwarding behavior of a FIB without introducing over-lapping routes. As a result, CFCA outperforms existingFIB caching techniques by achieving 99.94% cache-hit ratio with only 2.50% of the FIB entries in aTCAM cache.

(2) We design CFCA’s incremental BGP update handlingalgorithm. In CFCA, only 0.625% of BGP updatestrigger changes in TCAM; the total churn in TCAMis reduced by 40% compared to a standard FIB aggre-gation. In the meantime, the time cost for BGP up-date handling by CFCA exceeds PFCA’s by only 0.09µs(11.6%).

(3) We evaluate CFCA using traffic traces withmore than3.5 billion of packets and 45,600BGPupdates. Oursimulation demonstrates the advantages of CFCA overstandard FIB caching and aggregation approaches interms of cache-miss ratio and cache churn.

2 BACKGROUNDFIB aggregation. An example of FIB aggregation is givenin Table 1. Aggregability of a FIB depends on the numberof present adjacent or overlapping prefixes with the samenext-hop. ORTC-based algorithms [10, 24] reduce the sizeof a FIB by up to 80%. However, on average, optimal aggre-gation almost doubles the number of changes on a FIB to

preserve its compressed state. FAQS [22] significantly re-duces the number of FIB changes caused by BGP updatesapplied against a compressed FIB. Nevertheless, in the worstcase, a single BGP update in FAQS produces almost 6500changes in a FIB. Such a large number of changes in TernaryContent-Addressable Memory (TCAM) chips is detrimentalsince writing in TCAMmemory is an expensive and complexprocedure. For example, a single entry insertion can requireup to 1000 operations in a TCAM [16].

FIB caching. Numerous studies [11, 20, 23, 30] show thatmost of the network traffic is carried to a small set of desti-nation Autonomous Systems. This traffic includes real-timeapplications, social media networks, video streaming andrequires fast forwarding at the backbone routers.With FIBcaching, TCAM memory acts as a fast cache for a FIB andcontains only prefixes carrying the large amounts of thetraffic. The challenges for FIB caching include popular prefixselection, victim cache entry selection, latency for cache-missed LPM lookups. Finally, a FIB caching solution mustprevent cache hiding, where a less specific prefix hides morespecific one in a secondary FIB table, causing errors in LPMmatching. PFCA [14] leverages Protocol Independent SwitchArchitecture (PISA) [7] to build a FIB caching solution withminimized latency and line-rate cache table management.PFCA handles cache hiding through prefix extensions, i.e.,converting a FIB into a set non-overlapping prefixes and pre-serving such state through BGP updates. Our analysis of theresulting FIB in the cache showed that it is compressible byat least 15%. There are two sources of the redundant prefixesthat can be compressed:

(1) BGP-announced adjacent prefixes, that share the samenext hop in the original FIB.

(2) Prefixes, generated after a prefix extension procedure,which increases the number of entries by 40% [15].

Freeing cache memory may decrease the cache churn andincrease the cache-hit ratio. Yet the optimal aggregationalgorithms [10] can not be directly applied to a cache orthe main FIB, as they may introduce overlapping prefixesand trigger cache hiding. For example, suppose the originalFIB consists of entries shown in Table 1(a). Applying above-mentioned FIB aggregation algorithms halves this FIB (seeFigure 1(b)) but creates a risk of cache hiding. Specifically, ifthe longer prefix 129.10.124.192/26 is not in a cache, matchingthe IP address 129.10.124.192 in a cache results into the lessspecific prefix 129.10.124.0/24, which violates the LongestPrefix Matching rule for packet forwarding.

Programmable data plane. Protocol-Independent SwitchArchitecture (PISA) [7] represents a relevant platform forbuilding and deploying novel routing solutions. Unlike Open-flow [25], PISA implies the programmability of both thecontrol and the data planes. In this work, we build CFCA

Page 3: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

Boosting FIB Caching Performance with Aggregation HPDC ’20, June 23–26, 2020, Stockholm, Sweden

Figure 1: Portable Switch Architecture

Figure 2: The architecture of CFCA

with regard to the specification of the Portable Switch Ar-chitecture (PSA) [3], one of the PISA targets proposed bythe P4.org Architecture Working Group [2]. We show thedesign of PSA in Figure 1. The white blocks represent the pro-grammable components of PSA, such as the ingress/egressparser/deparser, packet processing pipelines, and the controlplane. The gray blocks represent the hardware blocks, respon-sible for packet replication, bufferization and queuing. Theprogrammable blocks are configured via the data plane pro-gramming P4 language [33]. P4 language allows developersto customize a packet parser and deparser, match-action ta-bles, ingress/egress flows in the data plane. In addition, withthe help of P4, a PSA data plane may contain user-definedregisters, meters, counters and perform hashing operations.

3 DESIGN OF CFCAAs shown in Figure 2, CFCA is built on top of ProgrammableFIB Caching Architecture (PFCA) [14]. The control planeof CFCA is responsible for collecting network routes andmanaging the tables in the data plane, namely Level-1 (L1)cache in TCAM, Level-2 (L2) cache in SRAM, and the restof the FIB in DRAM. To avoid cache hiding (see Section 2for details), the Route Manager (RM) extends the FIB routesreceived from the network into a set of non-overlappingprefixes (see Figure 3). The main departure of CFCA fromPFCA lies with the addition of the FIB aggregation mod-ule that compresses the extended prefix table of the RMinto a smaller set of non-overlapping prefixes. The prefixesare aggregated via a binary prefix tree where prefixes and

The leaf nodes represent the set of non-overlapping prefixes.Figure 3: Prefix extension in CFCA

their next-hops are represented as nodes and the siblingnodes with the same next-hop merge into the parent node(see Figure 4). Important, the FIB aggregation module canincrementally handle BGP updates and push those updatesinto FIB tables with no risk of cache hiding. The FIB entriesare provided with traffic counters to help detect frequently-accessed prefixes. A traffic counter is incremented for eachmatch, and when a threshold is reached, the entry migratesto a faster cache table. In addition, the router keeps track ofless popular routes (also known as light traffic hitters) in thecache for future cache replacement. The rest of the sectiondiscusses details of the RM, the FIB aggregation module, thedata plane workflow, and the light traffic hitters detectionalgorithm.

3.1 Control plane workflowThe Route Manager (RM) first processes route updates fromneighbor routers, and passes these updates to the FIB aggre-gation module. The RM then performs prefix migration bymoving popular prefixes into fast memory and unpopularprefixes to slow memory. The FIB aggregation module servesas a filter between the RM and the data plane. It keeps therouting table in a compressed state by performing incremen-tal FIB aggregation without introducing new overlappingroutes nor changing the forwarding behavior. In the follow-ing sections, we discuss how to achieve this task at eachstage of computation.

3.1.1 Initial FIB installation. At the initialization stage,CFCA’s RM receives the current prefix table’s snapshot - aRouting Information Base (RIB) - and extends it to get a setof non-overlapping prefixes, in order to avoid cache hidingdue to the rule dependencies (see Section 2). Prefix extension(see Figure 3 as an example) is done by adding the prefixesto the binary prefix tree and generating additional nodes inorder to get a full binary prefix tree. Such a tree in CFCA hasthe following properties:

Page 4: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

HPDC ’20, June 23–26, 2020, Stockholm, Sweden Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon

(1) The root node root of the tree corresponds to the prefix0/0 and is assigned with a default next-hop value.

(2) Each node n in the tree may have zero or two children.(3) The left (right) child of a node n.l (n.r) corresponds to

a more specific prefix with an appended 0(1) bit.(4) The set of non-overlapping prefixes is composed by

all the leaf nodes of the tree.

Each node n in CFCA’s full binary tree has the followingparameters:

(1) The current table of a node, n.t. Possible values areNONE, Level-1 (L1) or Level-2 (L2) cache, or DRAM.

(2) The prefix corresponding to that node, n.p.(3) REAL/FAKE flag n.rf, that indicates if a node was orig-

inated from the RIB (REAL node) or was generated asa result of prefix extension (FAKE node).

(4) The original next-hop n.o, taken from the RIB. TheFAKE nodes inherit the original next-hops of theirREAL parent nodes.

(5) The selected next-hop n.s, set by CFCA’s FIB aggrega-tion algorithm.

(6) FIB status n.f : IN_FIB or NON_FIB which indicatesif the corresponding prefix and its selected next-hopshould be in installed into the data plane.

Although prefix extension does not change the forwardingbehavior of a prefix table, it may significantly increase its size.CFCA’s aggregation algorithm mitigates this negative ef-fect by recursively "folding" adjacent prefixes that share thesame next-hop into a larger prefix and installing into the FIBonly one prefix per branch, thus preventing cache hiding. Weshow the pseudo-code of the initial FIB aggregation in Algo-rithm 1. The compression is done within a single post-ordertraversal over binary prefix tree starting from the root node.Consider an example with entries from Table 1a. The initialbinary prefix tree for these entries contains 5 leaf nodes (seeFigure 4(a)). However, at the left branch, nodes B and G sharethe same next-hop from the RIB; thus, they merge into F. Fselects its next-hop from B and G. Since that next-hop isequal to C’s next-hop, F and C similarly merge into E. At theright branch, leaves I and D have different next-hops fromthe RIB and do not merge; consequently, nodes E and H donot "fold" into A and the traversal terminates. As a result,instead of 5 prefixes, only 3 non-overlapping prefixes (fromthe nodes E, I and D) will be installed into the FIB. Morespecifically, at each node n, CFCA is running two operations:(1) Setting the selected next-hop value of n; (2) Setting theFIB status of the left and right children of n, namely n.l andn.r. Below is the detailed description of both operations.

(1) Setting the selected next-hop value n.s for thenoden. If n is a leaf node, then n.s is equal to theoriginal next-hop value n.o. In case n is an internal

(a) The prefix tree after prefix extension

(b) The prefix tree after FIB aggregationFigure 4: FIB aggregation in CFCA

node, its selected next-hop depends on the selectednext-hops of n.l and n.r :a. If n.l.s=n.r.s, then n.s=n.l.s;b. Otherwise, n.s=0.

(2) Setting the FIB status of the left and right childnodes of n (n.l.f and n.r.f ). This operation is per-formed for the internal nodes of the prefix binary tree.In case n.s is not equal to 0, then both n.l.f and n.r.fare set to NON_FIB. Otherwise, if n.s,0, the FIB sta-tus of a child node is set to IN_FIB if its own selectednext-hop value is not equal to 0. Initially, the entriescorresponding to IN_FIB nodes are pushed into theDRAM memory of the data plane. While the trafficis passing through the data plane, those entries maymigrate to L2 and L1 cache if they become popular.

Intuitively, a non-zero selected next-hop value n.s of a noden means that all the leaf descendant nodes of n have theoriginal next-hop value equal to n.s. Hence the FIB entriescorresponding to those leaf prefixes can be compressed into asingle entry. We call the nodes with a non-zero selected next-hop as points of aggregation. In the meantime, a zero value ofn.s means that not all of the descendant leaf nodes of n havethe same original next-hop value. If a left or a right childnode of n is a point of aggregation, then it is assigned with

Page 5: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

Boosting FIB Caching Performance with Aggregation HPDC ’20, June 23–26, 2020, Stockholm, Sweden

Algorithm 1: aggrInitInput: n, where n ← node, n.o ← its original

next-hop; n.s ← its selected next-hop; n. f ←its FIB status; n.t ← the table assigned to thecorresponding route, n.l ,n.r ← its left andright children.

1 if n is a leaf node then2 n.s = n.o

3 else if n is an internal node then4 aддrInit(n.l)

5 aддrInit(n.r )

6 if n.l .s = n.r .s then/* First case: children nodes share

the same next-hop */

7 n.s ← n.l .s

8 n.l . f ← n.r . f ←NON_FIB

9 else/* Second case: children nodes have

different next-hops */

10 n.s = 011 if n.l .s , 0 then

/* The left child of n is a point

of aggregation */

12 n.l . f ← IN_FIB, n.l .t ← DRAM13 Push the prefix and the selected next-hop

of n.l into the DRAM memory14 else15 n.l . f = NON_FIB

16 if n.r .s , 0 then/* The right child of n is a point

of aggregation */

17 n.r . f ← IN_FIB, n.r .t ← DRAM18 Push the prefix and the selected next-hop

of n.r into the DRAM memory19 else20 n.r . f ← NON_FIB

IN_FIB status. At this stage, our algorithm guarantees thatthe prefix region n.p is fully covered with a set of IN_FIBdescendant nodes of n. Thus, n and its all ancestor nodesshould be assigned with NON_FIB status.CFCA’s initial aggregation algorithm consists of a single

post-order traversal of a binary tree, hence its complexityis O(N ), where N is the number of nodes in a binary tree.The number of nodes N is equal to 2 ∗ Np − 1, where Npis the total number of non-overlapping prefixes (i.e., leafnodes). In the worst case, each original prefix p in a FIB canproduce up to w non-overlapping prefixes, where w is p’s

p,p0: Prefix n,n0: NodesNH : Next-hop value root : The root node of the tree

Figure 5: BGP update handling workflow in CFCA

prefix length. BGP update handling of CFCA has a similarcomplexity, as it leverages post-order traversals for keepinga FIB in a compressed state. We demonstrate it further inthis subsection.

3.1.2 BGP update handling. The RM receives BGP up-dates (new entry announcements, next-hop updates, and en-try withdrawals) from a BGP daemon running on the controlplane. The RM applies these updates to the binary prefix treeby updating or adding a node to it. Next, CFCA re-aggregatesthe sub-tree rooted at the updated (or added) node, as wellas the parents of that node (see Figure 5 for the workflowoverview). However, in most cases, the update affects only afew nodes of a tree. Below we give a detailed description ofhow CFCA handles different types of BGP updates.• Announcement of a new next-hop NH for an exist-

ing prefix p in a FIB. A next-hop update of a prefix p mayaffect not just p itself, but also the prefixes generated afterprefix extension. The resulting changes may require next-hop updates in the data plane, as well as de-aggregation orre-aggregation of prefixes overlapping with p. Thus, CFCAneeds to perform partial traversals of a tree branch on whichp resides. Next, we describe this process in details.After the RM finds the node n corresponding to p in the

prefix binary tree, it assigns the new value of the next-hop ton.o, the original next-hop of n. Next, it partially traverses thebranch rooted at n in a post-order manner (see Algorithm 2),avoiding the branches rooted at REAL nodes and visitingFAKE nodes only. Intuitively, the REAL descendants of n arenot affected by the next-hop update of n and traversing themis useless. The FAKE nodes, on the contrary, are generated

Page 6: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

HPDC ’20, June 23–26, 2020, Stockholm, Sweden Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon

Algorithm 2: postOrderUpdateInput: n,NH , where n ← node, n.o ← its original

next-hop; n.l ,n.r ← its left and right children;n.r f its REAL/FAKE flag, NH is the newnext-hop value.

1 if n.l .r f = FAKE then2 n.l .o ← NH

3 postOrderUpdate(n.l)

4 if n.r .r f = FAKE then5 n.r .o ← NH

6 postOrderUpdate(n.r )

7 setSelectedNextHop(n)

8 setF IBstatus(n)

Algorithm 3: setSelectedNextHopInput: n, where n ← node, n.s ← its selected

next-hop.1 if n is a leaf node then2 n.s = n.o

3 else if n is an internal node then4 if n.l .s = n.r .s then5 n.s ← n.l .s

6 else7 n.s = 0

during prefix extension in CFCA. Since they inherit next-hopvalues from n, they are updated with every change to n.o.

During the traversal, at each traversed node, CFCA:

(1) Updates the node’s original next-hop with NH.(2) Sets the selected next-hop of the node in the sameman-

ner as in the Initial FIB Aggregation. More specifically,for the leaf nodes, the selected next-hop is equal totheir original next-hops. For the internal nodes, theselected next-hop is equal to 0 if their children nodeshave different selected next-hop values. The proceduresetSelectedNextHop(n) is shown in Algorithm 3.

(3) Sets the FIB status of the node’s children (if they exist)and pushes the next-hop changes to the data plane.More specifically, if a node has a non-zero selectednext-hop, its children should not be present in the FIB.Otherwise, the children nodes with a non-zero selectednext-hops (i.e., points of aggregation) should stay or beinstalled into the data plane. Note that the RM tracksthe current location of entries with the value of n.r.tand thus is able to push the updates to the correct dataplane destination (L1, L2 caches or DRAM). We givethe pseudo-code of setFIBstatus(n) in Algorithm 4.

Once the post-order traversal is done, the ancestors of thenode n should be traversed in a bottom-up manner. There

Algorithm 4: setFIBstatusInput: n, where n ← node.

1 if n is an internal node then2 if n.s , 0 then

/* First case: n is the point of

aggregation */

3 if n.l . f = IN_FIB then4 Remove n.l from the data plane table n.l .t5 n.r . f ← NON_FIB6 if n.r . f = IN_FIB then7 Remove n.r from the data plane table n.r .t8 n.r . f ← NON_FIB

9 else if n.s = 0 then/* Second case: n’s descendants

prefixes should be present in FIB*/

10 if n.l . f = NON_FIB then11 n.l . f ← IN_FIB, n.l .t ← DRAM

12 Push the prefix and the selected next-hopof n.l into the DRAM memory

13 else14 Push the selected next-hop update of the

prefix of n.l into the data plane tablen.l .t .

15 if n.r . f = NON_FIB then16 n.r . f ← IN_FIB, n.r .t ← DRAM

17 Push the prefix and the selected next-hopof n.r into the DRAM memory

18 else19 Push the selected next-hop update of the

prefix of n.r into the data plane tablen.r .t .

are two reasons why such procedure is necessary: first, theancestors’ selected next-hop value depends on their chil-dren’s selected next-hops. Second, in CFCA, the FIB statusof a node is defined by its parent. Since the changes in atree propagate through selected next-hops, the bottom-uptraversal can terminate if a node’s selected next-hop does notchange. Similarly to the post-order traversal, at each visitednode n, CFCA calls both setSelectedNextHop(n) and setFIBsta-tus(n). We show the pseudo-code of bottomUpUpdate(n) inAlgorithm 5. Note that bottomUpUpdate(n) is called only ifthe value of n.s is changed after calling setSelectedNexthop(n).Although a new route announcement and a route with-

drawal requiremore operational steps by the RM, theymostlyrely on the same routines as with a route update, i.e., pos-tOrderUpdate(n, NH) and bottomUpUpdate(n). We show itbelow in this section.

Page 7: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

Boosting FIB Caching Performance with Aggregation HPDC ’20, June 23–26, 2020, Stockholm, Sweden

Algorithm 5: bottomUpUpdateInput: n, where n ← node.

1 repeat2 n ← parent node of n3 oldSelectedNextHop ← n.s

4 setSelectedNextHop(n)

5 setF IBstatus(n)

6 until oldSelectedNextHop = n.s

(a) Update for C and announcement of aprefix at H

(b) Nodes E, F, C, H, I, D have been affected

Figure 6: BGP updates handling by CFCA

Algorithm 6: fragmentPrefixInput: p,NH, where p is the newly announced prefix;

NH is p’s next-hop.1 n0 ← The root node of the tree2 foreach bit b of p do3 if n0 is a leaf then4 break

5 n0 ← b = 0 ? n0.l : n0.r6 l0, l ←Prefix lengths of n0.p and p7 foreach level of the tree ∆ ∈ (l0, l) do8 Generate sibling FAKE nodes with the original

next-hop of n09 Generate the REAL node n for p, n.o ← NH

10 Generate the FAKE sibling node for n, n.o ← n0.o

11 return n0

• Announcement of a new route (p, NH), where p isthe prefix and NH is its next-hop value. Two cases may

happen. In the first case, a node n associated with p alreadyexists in the binary prefix tree. CFCA handles such announce-ments similarly to the next-hop updates. The RM updatesn’s original next-hop value, changes its type to REAL (if itwas FAKE prior to the update) and runs postOrderUpdate(n,NH) and bottomUpUpdate(n) routines.

In the second case, the node n for p should be first addedto the binary prefix tree. For that goal, the RM traversesthe tree accordingly to the bits of p, to find n0, the existingleaf ancestor for n. Next, the RM installs the node n whilegenerating FAKE sibling nodes for each level of the treebetween n0 and n. Such generation is necessary to avoidcache hiding. These siblings nodes will inherit their originalnext-hop value from n0 and their prefix values will belongto the prefix range of n0 with exclusion of p. We call thisprocedure prefix fragmentation, since it fragments the prefixof n0 into a set of non-overlapping prefixes with p. The psedo-code of prefix fragmentation is given in Algorithm 6.

The newly added nodes need to be aggregated, thus CFCAcalls the initial aggregation aggrInit starting from the noden0. Finally, in case the selected next-hop of the node n0 ischanged, CFCA runs bottomUpUpdate(n0) to re-aggregate theupper part of the branch, until the first unchanged ancestor.•Withdrawal of a prefix p. Suppose n is the node that

represents p in the binary prefix tree. CFCA handles prefixwithdrawals as follows:

(1) The Route Manager sets the type of n to FAKE.(2) The node inherits its original next-hop value from

the parent node. This value might be a default next-hop, inherited from the root node of the tree, or anarbitrary next-hop from a less specific prefix existingin the original RIB.

(3) Similarly to the next-hop updates, the aggregationmodule runs postOrderUpdate(n, NH) and bottomUpUp-date(n) routines.

Intuitively, the prefix withdrawal operation in CFCA canbe represented as a node’s original next-hop update with thatnode’s parent’s original next-hop value. In the meantime,CFCA ensures the compacted state of the binary prefix tree,by detecting and removing sibling FAKE leaf nodes from thedata structure.In Figure 6, we illustrate BGP update handling by CFCA.

We continue using the entries from Table 1a. In this example,the RM received two BGP updates: a next-hop update for theprefix 129.10.124.64/26 and an announcement of a new prefix129.10.124.128/25. For the first update, CFCA reaggregatesthe tree with bottom-up traversal from the node C, afterwhich the nodes E, F and C change their FIB status. For theannouncement of the 129.10.124.128/25, the algorithm firstchanges the type of the corresponding node H from FAKEto REAL. Next, it performs the post-order and bottom-up

Page 8: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

HPDC ’20, June 23–26, 2020, Stockholm, Sweden Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon

p: Prefix NH (p): Next-hop for pC(p): Counter for prefix p T : Threshold

Figure 7: Data plane workflow in CFCA

traversals from the node H, after which the prefixes at thenodes I and D aggregate into H.

3.1.3 Popular routes installation. The RM "migrates" theentries of a FIB to faster memory once a route in the L2 cachetable or the DRAM reaches a certain amount of hits, for exam-ple, 100 hits per second. The threshold can be pre-configuredby the control plane of a switch. If a faster memory is full,then the RM frees it by picking a victim cache entry withthe help of Light Traffic Hitters Detection module of the dataplane and moving that entry to slower memory.

3.2 Data plane workflowSimilarly to PFCA, the essential part of the CFCA’s dataplane is the Light Traffic Hitters Detection (LTHD) module,designed in order to collect the least popular routes of Level-1 (L1) and Level-2 (L2) caches for future cache replacement.As we show later in this paper, LTHD can be built in theswitches with the Portable Switch Architecture [3]. The flowbetween the match-action tables of the data plane (L1, L2caches and the DRAM FIB) is managed by the programmableingress pipeline. When a packet arrives at a CFCA switch, itsdestination IP address is matched against the L1 cache FIB onTCAM. In the case of a match, the matching prefix’s counteris increased by one and its value is passed through the LTHDmodule. The packet is then forwarded to the next-hop.In the case of an L1 cache miss, the packet’s IP header is

passed to the L2 cache and the DRAM table in parallel. Onan L2 cache match, CFCA increases the counter value of thematching prefix and checks if it reached a pre-configuredthreshold. If so, the prefix is being installed into the L1 cache.

In case the L1 cache is full, CFCA randomly picks an unpop-ular route from the LTHD module’s tables and replaces itwith a new route from the L2 cache. Lastly, the unpopularprefix from the L1 cache migrates back to the L2 cache. If thethreshold on the L2 cache for a matched prefix is not reached,CFCA passes the prefix and its counter value through theLTHDmodule of the L2 cache. In each case, a match in the L2cache results in the forwarding of the packet to the next-hop,corresponding to the matched prefix.

Finally, if the match happens on DRAM, then, similarly tothe L2 cache, CFCA checks if the matched prefix’s countervalue reached a pre-configured threshold and installs thatprefix into the L2 cache if needed. As with the L1 cache, ifthe L2 cache is full, the cache victim entry is selected fromthe LTHD module and migrated into the FIB on DRAM.The workflow of CFCA’s data plane is presented in Fig-

ure 7. Next, we describe the Light Traffic Hitters Detectionmodule in more detail.

3.3 Light Traffic Hitters DetectionThe light traffic hitters are the least popular prefixes in thecache. When an entry from a slower memory needs to beinstalled into a full cache, an efficient FIB caching solutionshould perform line-rate light hitters selection among tensof thousands of FIB routes.As CFCA is designed on top of PFCA [14], it uses the

same mechanism to collect the light traffic hitters in the dataplane. The Light Traffic Hitters Detection module (LTHD)is able to find the least popular prefixes without real-timescanning through the cache. Its design is similar to "Heavy-Hitters" detection proposed by Sivaraman et al. in [31]. Tocollect the least popular prefixes, LTHD uses d hash tablesT1,T2, ..,Td . Each of the tables is assigned with a specifichash key h1,h2, ..,hd . On a cache hit, the matched prefix pand its counter value c is pipelined through the hash tables.Consequently, LTHD selects the most popular prefix amongp and a subset of prefixes in tables T1,T2, ..,Td and replacesthat prefix with p, unless p is not the most popular one.Consider an example in Figure 8 with a 3-stage LTHD

and a pipelined prefix p1=11011 and its counter c1=3. Atthe first stage, p1 is compared against the third entry of thetable T1, because its hash h1(p1)=3. Since the third entry’scounter is less than c1,p1 is carried to the second stage. There,h2(p1)=2 and T2’s second entry with the prefix p2=01101 hasthe counter c2>c1. Thus, p1 replaces p2 and p2 is carried to thelast stage. Finally, h3(p2)=1 and the first entry in the TableT3 has a counter c3>c2. Thus, p2 replaces p3=01101 in T3 as aless popular prefix. p3 is removed from the LTHD and willnot be considered as a potential victim cache entry.

Page 9: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

Boosting FIB Caching Performance with Aggregation HPDC ’20, June 23–26, 2020, Stockholm, Sweden

Figure 8: Example of Light Traffic Hitters Detection

4 EVALUATIONWe evaluate the performance of CFCA by comparing it topure FIB caching (PFCA) and FIB aggregation algorithmswith incremental BGP updates handling (FAQS and FIFA-S):

• PFCA is a FIB caching-only architecture for a pro-grammable switch described in [14]. We use PFCAfor comparison in terms of cache-miss ratio and BGP-related cache churn.• FAQS [22] is a FIB aggregation algorithmwith minimaldata plane churn and suboptimal aggregation ratio.• FIFA-S [24] is an ORTC-based FIB aggregation algo-rithm with an optimal aggregation ratio.

To emulate the operation of a programmable switch withCFCAwe designed its data plane prototype with P4 language.As works on the P4 compiler for PSA are in progress [32],our prototype was designed for the Simple Switch archi-tecture [34], and thus had limited capabilities compared tothe original design of CFCA. For example, "Simple Switch"lacks the generation of packet digests in a data plane, i. e.,messages that can be forwarded to a control plane, which isnecessary for detecting the heavy hitters at the latter. Whileour future work will focus on designing the full P4 prototypeof CFCA, in this paper, we evaluate the efficiency of CFCA’scombined FIB caching and aggregation using a trace-drivensimulator written in C.

4.1 Experimental setupWe use RouteViews [4] to build a routing table that contains599,298 entries. Our experiment simulates router operationsfor about an hour by processing a mixed trace of 45,600BGP updates of RouteViews and an anonymized traffic tracefrom CAIDA [8] with 3.5 billion packets. For performanceevaluation, we measure the following metrics:

(a) CFCA

0

0.2

0.4

0.6

0.8

1

10000 20000 30000

Ca

ch

e-m

iss r

atio

(%

)

Number of packets/100K

Level-1 CacheLevel-2 Cache

Level-1 Cache (Avg)

(b) PFCA

0

0.2

0.4

0.6

0.8

1

10000 20000 30000

Ca

ch

e-m

iss r

atio

(%

)

Number of packets/100K

Figure 9: CFCA vs PFCA. Cache-miss ratio per 100Kpackets for 15,000 L1 and 20,000 L2 caches

(1) Cache-miss ratio per 100K packets for Level-1 (L1) andLevel-2 (L2) caches.

(2) The number of cache installations/evictions in L1 andL2 caches.

(3) The number of BGP updates applied to L1 and L2caches compared to the total number of BGP updates.

We tune both CFCA and PFCA in the same way as follows:(1) We set the number of stages in the Light Traffic Hitters

Detection (LTHD) module to four.(2) Each stage has a hash table of size 10.(3) Initially, L1 and L2 caches are empty; however, they

are quickly filled up as the initial threshold for cacheinstallations in DRAM and Level-2 cache are 1 and 15matches, respectively.

(4) Once the L1 and L2 caches are full, we set the thresh-olds to 100 matches per minute for DRAM and to 300matches per minute for L2 cache.

(5) We compare the performance of CFCA and PFCA fordifferent sizes of L1 and L2 caches, namely 5K with10K, 10K with 15K, and 15K with 20K.

We verified the correctness of BGP update handling by CFCA,PFCA, FAQS and FIFA-S with VeriTable [15], a tool designedfor fast verification of forwarding equivalence of multipleforwarding tables.

4.2 CFCA vs FIB Caching (PFCA)The results that compare CFCA with PFCA are shown inTable 2.

4.2.1 Cache-miss ratio. We first measure cache-miss ra-tios. Here, the larger values are, the more IP lookups areto be performed at slow memory causing packet forward-ing delays and even losses. It is ideal to have the L1 cachemiss ratio is minimized. In CFCA, the average cache-missratio is 1.13% when the L1 cache contains only 0.83% en-tries of the FIB. Also, if the L1 cache contains more prefixes(2.50% of the FIB), the cache-miss ratio drops drastically to0.058%, which is more than five times less than PFCA’s aver-age cache-miss ratio (0.316%). Figure 9 shows this cache-missratio improvement through FIB aggregation (see the red linefor the average). The number of L1 cache-miss lookups in

Page 10: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

HPDC ’20, June 23–26, 2020, Stockholm, Sweden Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon

CFCA/PFCA

L1 sizeratio, %

L1size

L2size

L1 misses,%

L2 misses,%

L1installations

L2installations

L1 BGPchurn

L1 BGPburst

CFCA0.83 5,000 10,000 1.133 0.153 109,070 623,994 88 21.67 10,000 15,000 0.144 0.017 18,108 150,547 156 52.50 15,000 20,000 0.058 0.004 17,277 37,842 285 6

PFCA0.83 5,000 10,000 4.135 1.258 302,337 664,580 51 81.67 10,000 15,000 0.873 0.211 72,400 118,847 133 122.50 15,000 20,000 0.316 0.071 32,242 58,246 195 12

Table 2: CFCA vs FIB caching with PFCA. Evaluation summary

CFCA/FAQS

Compressionratio, %

FIBchurn

BGPburst

CFCA 2.50 21,495 6FAQS 28.04 36,386 43FIFA-S 25.08 48,672 33

Table 3: CFCA L1 cache vs FAQS/FIFA-S

(a) Installations andevictions

0

5000

10000

15000

20000

25000

30000

35000

10000 20000 30000

CacheinitializationC

ach

e in

sta

llatio

ns

Number of packets/100K

CFCAPFCA

(b) Number of BGPupdates

1

10

100

1000

10000

100000

10000 20000 30000

BG

P u

pd

ate

s

Number of packets/100K

TotalCFCA L1PFCA L1

Figure 10: L1 cache in CFCA and PFCA

CFCA rarely exceeds 0.4%, while the number of cache-missesis negligible for L2. In contrast, the cache-miss ratio of PFCAreaches nearly 1% in the worst case.

4.2.2 Cache installations and evictions. In CFCA, the con-trol plane aggregates FIB entries before pushing them intothe data plane, and this aggregation dramatically decreasesboth L1 and L2 cache installation and eviction rates. The lowinstallation and eviction rates help save energy and improveperformance as writing on TCAM is computationally expen-sive [16]. We show the L1 cache installations for both PFCAand CFCA in Figure 10a. We set the sizes of the L1 and L2caches to 15,000 and 20,000 entries, respectively. A cacheentry installation does not require eviction until the cacheis fully initialized (see the dashed lines in the figure). Notethat both CFCA and PFCA apply the cold start strategy, i.e.,L1 and L2 cache tables are initially empty and are filled upwith traffic later. Once the cache tables become full, cache re-placement is needed for a new cache installation, and we useLight Traffic Hitters Detection to identify victims for eviction.

4.2.3 BGP updates impact. One drawback of FIB aggre-gation is the potential cache churn increase in TCAM due tothe BGP updates applied to a compressed FIB. Table 2 showsthat BGP-related cache churn in the L1 cache of CFCA isslightly increased compared to PFCA, because PFCA doesnot aggregate the FIB. However, the percentage of TCAMupdates in CFCA remains low. As an example, only 0.625% ofall BGP updates result in TCAM updates when the L1 cachesize is 15,000 (see Figure 10b). In addition, we evaluate BGPburst in L1, i.e., the maximum number of FIB changes causedby a single BGP update. Interestingly, CFCA shows lowerBGP burst in all the configurations as shown in Table 2. TheFIB aggregation compresses prefixes generated by the prefixextension, and therefore a single next-hop update of a parentprefix affects fewer entries in TCAM than PFCA with moreextended prefixes present in TCAM.

4.3 CFCA vs FIB Aggregation(FAQS/FIFA-S)

FIB aggregation minimizes its disadvantages when appliedtogether with FIB caching. Table 3 shows that the total churn,which includes L1 installations, evictions, and next-hop up-dates, in CFCA with 15,000 entries is far less than in FAQS,that is a FIB aggregation scheme with the minimal BGP-related cache churn. For ORTC-based algorithms with op-timal aggregation ratio like FIFA-S, the total churn is evenlarger. Moreover, CFCA does not produce large BGP bursts,which we attribute to the observation that BGP updates aremostly related to unpopular routes. The only strength ofFIB aggregation over FIB caching is that all the routes stayin a cache. In FIB caching, on the other hand, the lookupsthat do not find a prefix match in a cache are forwarded tolower memory obviously increasing forwarding delay. CFCAmitigates this drawback as it reduces the cache-miss ratio to0.058% for the L1 cache with only 2.50% entries in the FIB.

4.4 Evaluating a heavier traceIn the previous subsections, we showed the performance ofCFCA under a traffic load with the mean transmission rate of

Page 11: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

Boosting FIB Caching Performance with Aggregation HPDC ’20, June 23–26, 2020, Stockholm, Sweden

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

10000 30000 50000 70000

Ca

ch

e-m

iss r

atio

(%

)

Number of packets/100K

Level-1 CacheLevel-2 Cache

Level-1 Cache (Avg)

Figure 11: CFCA.Cache-miss ratio per100K packets under a

heavier load

0

500

1000

1500

2000

2500

3000

3500

4000

4500

300 600 900 1200

Tim

e(m

s)

Number of BGP updates/1K

CFCAPFCAFAQS

FIFA-S

Figure 12: BGP updatehandling time

about 1M packets per second. To test the behavior of CFCAunder a heavier load, we collected a more recent (01/17/2019)1-hour trace from CAIDA [8] with 7.7 billions of packets anda 1-hour BGP update trace with 1,364,963 routing updates.We also used a larger routing table with 725,344 entries andL1/L2 cache with 20,000 and 30,000 entries respectively.

Due to significantly increased packet frequency, the aver-age cache-miss ratio for this trace is higher (≈0.27%). How-ever, CFCA keeps it within 1% (see Figure 11). For this trace,we also compared the BGP update handling time by CFCAvs PFCA, FAQS and FIFA-S (see Figure 12) . CFCA’s controlplane takes only 0.78µs per a BGP update on average, slightlyslower than PFCA (0.69µs per update), due to the additionalFIB aggregation. In the meantime, the time cost of FIB ag-gregation algorithms is much higher (2.11µs and 3.50µs perupdate for FAQS and FIFA-S respectively).

5 RELATEDWORKFIB aggregation. Given an arbitrary FIB table, the OptimalRouting Table Construction algorithm (ORTC) [10] buildsan equivalent routing table with a minimal number of en-tries. Bienkowski et al. in [6] present the formal study onthe trade-offs between FIB compression optimality and thecost of BGP updates and propose HIMS, an online FIB ag-gregation algorithm. FAQS [22] minimizes the data planechurn and BGP update handling complexity by sacrificingFIB compression ratio.

FIB caching. The idea of caching entries in a forwardingtable goes back to the early stages of the Internet. In [17],Raj Jain et al. showed the "source locality" of packets andproposed caching FIB entries for active flows and prefetchingentries for the corresponding reverse flows. Route cachingwas studied for Layer-4 switching as well [36]. Kim et al. givean overview of caching popular flows of an IP routing table in[20]. Sarrar et al. in [30] propose Traffic-aware Flow Offload-ing (TFO) strategy that outperforms LRU for a cache with lessthan 10,000 entries. In [29], authors apply rule caching forlossy compression of packet classifiers and show that 3% ofthe original classifiers are sufficient for handling 99% of thetraffic. The paper does not elaborate on a chosen real-time

cache victim selection strategy. Katta et al. in Cacheflow [19]build dependency graphs and identify independent groupsof rules to avoid cache hiding. Such an approach requiresthe control plane to perform graph computations each timea flow becomes popular and needs to be installed in thecache. Similarly, Bienkowski et al. in [5] propose an onlineTree Caching algorithm, that requires no prefix extensions.Evaluation of Cacheflow showed 97% cache-hit rate with10%-sized TCAM; the authors of Tree Caching do not provideexperimental results but instead give a theoretical analysis ofits algorithm. Both algorithms are suited for a data center’sSoftware-Defined Networks with medium-sized FIBs.

IHARC [9] speeds up software IP lookup by utilizing CPUcache for route caching. It merges prefixes in the cache byselecting index bits that point to different cache sets withmaximum mergeability. However, IHARC’s architecture hasseveral shortcomings. First, IHARC leverages a 3-level NART(Network Address Routing Table) data structure, with pre-fixes flattened to the lengths 8, 24 and 32. This significantlyincreases the size of a routing table and consequent updatechurn. Second, the index selection routine is an expensiveoperation invoked at each routing update. The updates re-quire IHARC to entirely invalidate its caches for maintainingforwarding correctness, increasing cache-miss ratios. In [13],the authors improve IHARC with selective cache invalida-tions and reduced recomputations of index bits.PFCA [14] leverages the programmable data plane in or-

der to build a FIB caching architecture with two levels ofcache. In this work, we enhance PFCAwith a FIB aggregationmodule. Our novel approach, CFCA, achieves remarkablyhigh cache-hit ratios with small cache sizes. While incre-mentally handling the routing updates, CFCA significantlydecreases total FIB churn in TCAM compared to the standardFIB caching and aggregation algorithms.

6 CONCLUSIONWith the migration of computations into clouds, network op-erators face more challenges with packet forwarding. Specif-ically, they need to ensure fast prefix lookups in FIBs whileamounts of traffic and the size of the global routing tablecontinue to grow. In this work, we present CFCA, CombinedFIB Caching and Aggregation, which uses two FIB compress-ing techniques to offload the expensive TCAM memory usedfor FIB from unpopular routes. Evaluation of CFCA on largetraffic trace and BGP update traces shows remarkably lowcache-miss ratios and FIB churn compared to the existing FIBcaching solutions. In particular, CFCA achieves up to 99.94%cache-hits and stabilizes the cache by more than 46% with2.5% of the entire FIB in the cache. Finally, compared to FIBaggregation algorithms, CFCA avoids large bursts of TCAMupdates, reducing costly TCAM churn by 40% on average.

Page 12: Boosting FIB Caching Performance with Aggregationjmk/papers/cfca.pdf · 2020. 4. 19. · size of a global FIB at backbone routers reached 820K IPv4 and 80K IPv6 routes in 2020 [1]

HPDC ’20, June 23–26, 2020, Stockholm, Sweden Garegin Grigoryan, Yaoqing Liu, and Minseok Kwon

ACKNOWLEDGMENTSWe would like to thank our shepherd, Dr. Kartik Gopalan,and the reviewers for their helpful comments.

REFERENCES[1] BGP routing table analysis reports. https://bgp.potaroo.net/. 2019.[2] P4 language consortium. https://p4.org/. 2019.[3] Portable switch architecture. https://p4.org/p4-spec/docs/PSA.html.

2019.[4] Advanced network technology center and University of Oregon. The

RouteViews project. http://www.routeviews.org/. 2017.[5] Marcin Bienkowski, Jan Marcinkowski, Maciej Pacut, Stefan Schmid,

and Aleksandra Spyra. 2017. Online tree caching. In Proceedings of the29th ACM Symposium on Parallelism in Algorithms and Architectures.ACM, 329–338.

[6] Marcin Bienkowski, Nadi Sarrar, Stefan Schmid, and Steve Uhlig. 2018.Online aggregation of the forwarding information base: accountingfor locality and churn. IEEE/ACM Transactions on Networking 26, 1(2018), 591–604.

[7] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown,Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, GeorgeVarghese, and David Walker. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communi-cation Review 44, 3 (2014), 87–95.

[8] Center for applied Internet data analysis. http://www.caida.org/home/.2019.

[9] Tzi-Cker Chiueh and Prashant Pradhan. 2000. Cache memory designfor network processors. In Proceedings of the Sixth International Sym-posium on High-Performance Computer Architecture. IEEE, 409–418.

[10] Richard P Draves, Christopher King, Srinivasan Venkatachary, andBrian D Zill. 1999. Constructing optimal IP routing tables. In Proceed-ings of IEEE INFOCOM 1999, Vol. 1. IEEE, 88–97.

[11] Kaustubh Gadkari, M Lawrence Weikum, Dan Massey, and ChristosPapadopoulos. 2016. Pragmatic router FIB caching. Computer Com-munications 84 (2016), 52–62.

[12] Geoff Huston. BGP in 2019 – the BGP table. https://blog.apnic.net/2020/01/14/bgp-in-2019-the-bgp-table/. 2020.

[13] Kartik Gopalan and Tzi-cker Chiueh. 2002. Improving route lookupperformance using network processor cache. In Proceedings of the 2002ACM/IEEE Conference on Supercomputing. IEEE, 22–22.

[14] Garegin Grigoryan and Yaoqing Liu. 2018. PFCA: a programmablefib caching architecture. In Proceedings of the 2018 Symposium onArchitectures for Networking and Communications Systems. ACM, 97–103.

[15] Garegin Grigoryan, Yaoqing Liu, Michael Leczinsky, and Jun Li. 2018.VeriTable: Fast Equivalence Verification of Multiple Large ForwardingTables. In Proceedings of IEEE INFOCOM 2018. IEEE, 621–629.

[16] Peng He, Wenyuan Zhang, Hongtao Guan, Kavé Salamatian, and Gao-gang Xie. 2017. Partial order theory for fast tcam updates. IEEE/ACMTransactions on Networking 26, 1 (2017), 217–230.

[17] Raj Jain and Shawn Routhier. 1986. Packet trains–measurements anda new model for computer network traffic. IEEE Journal on SelectedAreas in Communications 4, 6 (1986), 986–995.

[18] Naga Katta, Omid Alipourfard, Jennifer Rexford, and David Walker.2014. Infinite cacheflow in software-defined networks. In Proceedingsof the Third Workshop on Hot Topics in Software Defined Networking.ACM, 175–180.

[19] Naga Katta, Omid Alipourfard, Jennifer Rexford, and David Walker.2016. Cacheflow: Dependency-aware rule-caching for software-defined networks. In Proceedings of the Symposium on SDN Research.1–12.

[20] Changhoon Kim, Matthew Caesar, Alexandre Gerber, and JenniferRexford. 2009. Revisiting route caching: the world should be flat.In Proceedings of the International Conference on Passive and ActiveNetwork Measurement. Springer, 3–12.

[21] Alex X Liu, Chad RMeiners, and Eric Torng. 2016. Packet classificationusing binary content addressable memory. IEEE/ACM Transactions onNetworking 24, 3 (2016), 1295–1307.

[22] Yaoqing Liu and Garegin Grigoryan. 2018. Toward incremental FIBaggregation with quick selections (FAQS). In Proceedings of the 17thInternational Symposium on Network Computing and Applications. IEEE,1–8.

[23] Yaoqing Liu, Vince Lehman, and LanWang. 2015. Efficient FIB cachingusing minimal non-overlapping prefixes. Computer Networks 83 (2015),85–99.

[24] Yaoqing Liu, Beichuan Zhang, and Lan Wang. 2013. FIFA: Fast incre-mental FIB aggregation. In Proceedings of IEEE INFOCOM 2013. IEEE,1–9.

[25] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar,Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner.2008. OpenFlow: enabling innovation in campus networks. ACMSIGCOMM Computer Communication Review 38, 2 (2008), 69–74.

[26] Xiaoqiao Meng, Zhiguo Xu, Beichuan Zhang, Geoff Huston, SongwuLu, and Lixia Zhang. 2005. IPv4 address allocation and the BGP routingtable evolution. ACM SIGCOMM Computer Communication Review 35,1 (2005), 71–80.

[27] Mehrdad Moradi, Feng Qian, Qiang Xu, Z Morley Mao, Darrell Bethea,and Michael K Reiter. 2015. Caesar: High-speed and memory-efficientforwarding engine for future internet architecture. In Proceedings ofthe Symposium on Architectures for Networking and CommunicationsSystems. IEEE, 171–182.

[28] P. Lumbis, Y. Ramdoss, D. Miller. CAT 6500 and 7600 Series Routers andSwitches TCAMAllocationAdjustment Procedures. https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/117712-problemsolution-cat6500-00.html. 2014.

[29] Ori Rottenstreich and János Tapolcai. 2016. Optimal rule caching andlossy compression for longest prefix matching. IEEE/ACM Transactionson Networking 25, 2 (2016), 864–878.

[30] Nadi Sarrar, Steve Uhlig, Anja Feldmann, Rob Sherwood, and XinHuang. 2012. Leveraging Zipf’s law for traffic offloading. ACM SIG-COMM Computer Communication Review 42, 1 (2012), 16–22.

[31] Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, SMuthukrishnan, and Jennifer Rexford. 2017. Heavy-hitter detectionentirely in the data plane. In Proceedings of the Symposium on SDNResearch. ACM, 164–176.

[32] The P4 language consortium. Compiler for PSA. https://github.com/p4lang/behavioral-model/issues/873. 2020.

[33] The P4 language consortium. P416 language specification. https://p4lang.github.io/p4-spec/docs/P4-16-v1.0.0-spec.pdf. 2020.

[34] The P4 language consortium. The BMv2 simple switch tar-get. https://github.com/p4lang/behavioral-model/blob/master/docs/simple_switch.md. 2020.

[35] Zahid Ullah, Manish K Jaiswal, and Ray CC Cheung. 2015. Z-TCAM:an SRAM-based architecture for TCAM. IEEE Transactions on VeryLarge Scale Integration Systems 23, 2 (2015), 402–406.

[36] Jun Xu, Mukesh Singhal, and Joanne Degroat. 2000. A novel cachearchitecture to support layer-four packet classification at memoryaccess speeds. In Proceedings of IEEE INFOCOM 2000, Vol. 3. IEEE,1445–1454.

[37] Xiaoliang Zhao, Dante J Pacella, and Jason Schiller. 2010. Routingscalability: an operator’s view. IEEE Journal on Selected Areas in Com-munications 28, 8 (2010), 1262–1270.