dave allan, ericsson · a tutorial on the pruning algorithm from...

A Tutorial on the Pruning Algorithm from

draft‐allan‐spring‐mpls‐multicast‐framework‐02

Dave Allan, Ericsson

[email protected]

Overview

This document is intended as a companion to “A Framework for Computed Multicast applied to

MPLS based Segment Routing”, a.k.a draft‐allan‐spring‐mpls‐multicast‐framework‐02, published at

www.ietf.org. The purpose was to elaborate on the algorithm descriptions in the document in order to

assist the development of interoperable implementations.

Background

The origin of the concept was 802.1aq/RFC6329 Shortest Path Bridging. The notion was that a

separate signaling protocol for multicast was dispensed with and multicast membership

information disseminated in the IGP. 802.1aq incorporated a tie breaking algorithm that

permitted an (S,*) template to be established from which (S,G) trees could be derived.

A little over a year ago the suggestion was floated to apply the 802.1aq algorithm to SPRING as

with domain wide SIDs, all the underlying pieces were there, and the exercise would be

comparatively trivial and the SPRING paradigm of a single control protocol could be preserved.

This starting point resulted in a realization that as the dataplane included an apriori mesh of

unicast tunnels and a technology (label stacking) that made utilizing tunnels at least

superficially a trivial operation, it would be desirable to incorporate these into MDT

construction. The primary motivations being

‐ Dataplane state reduction, both less state in the LIBs as well as less N‐S synchronization

‐ Leverage unicast recovery for most failures

‐ Reuse of the MPLS dataplane

‐ The use of tunnels meant some computational complexity could be avoided by

detecting when the computing node would not need to install state for a given tree, and

therefore not processing it to completion

A consequence of this was that simply using an (S,*) template tree would not work well with

ECMP, which serendipitously resulted in requiring minimum cost or near minimum cost trees to

address adding bandwidth efficiency to the list of virtues of the approach.

The Problem Space

If we are to construct MDTs that have unicast tunnels that may be subject to ECMP treatment

as a component, we would like to avoid multiple copies of a multicast packet transiting the

same link. The following diagram illustrates the problem.

Figure 1

In figure 1, node 9 is the root, and nodes 4, 13, 14, 15 16 and 17 are leaves. In the example all

link weights are equal.

By whatever tie breaking mechanism an MDT illustrated via the green lines has been

constructed (in this case, using tie breaking rules from RFC 6329/802.1aq). This results in

replication points at nodes 2, 6 and 12 for this particular set of leaves. The red lines illustrate

where unicast tunnels could be employed, and highlights the problem; which is that the

multiple copies of a multicast packet scenario could occur in multiple places: on the adjacency

from 12 to 8, and the adjacency from 6 to 11 to 5. ECMP spreading of traffic of the tunnel from

6 to 13 and from 12 to 4 includes interfaces used by other tunnels in the MDT in the set of

possible outcomes. This is undesirable from the point of view of both inefficiency, and

implementation complexity.

17

9

1

1

1

2

2

2

2

2

3

3

3

3

4

4

4

45

11

8

12

6 2

3

7

1

1014

13

15

16

4

Given the same topology and an MDT with the same roots and leaves, the following is an

example of an MDT that is ECMP friendly:

Figure 2

To contrast the two trees, the second involves less nodes installing state, and fewer hops

traversed by copies of any given multicast packet (11 vs. 16).

There is a couple of things to observe in this.

First, if ECMP is not considered, we could probably live with the inefficiencies of an (S,*)

template tree whereas generating ECMP friendly trees that can employ tunnels without any

duplication on interfaces requires a unique (S,G) solution.

Second the (S,G) solution I would characterize as “near minimum cost” as ECMP friendly is not

necessarily absolute minimum cost (although I do not have an example in my back pocket of an

ECMP friendly non‐minimum‐cost tree).

The Algorithm Concept

17

14

16

9

13

4

1

1

1

2

2

2

2

2

3

3

3

3

4

4

4

415

8

2

12

61

11

10

5

3

7

The algorithm is predicated on the notion that there are two classes of prunes and

simplification steps, those if repeatedly applied and successfully fully resolve the tree will

generate in an ECMP friendly result, and those that are not guaranteed to produce an ECMP

friendly result. The former we will refer to as “safe” operations and the latter as “unsafe”. The

objective is to simplify the multicast tree to being an acyclic tree that only has only the root,

leaves and replication points as vertices. A different characterization of the two classes of

operation would be that “safe” decisions that can be performed only considering the state of a

single node, and a successful result from “unsafe” decisions depend on the coordination of the

decisions of multiple nodes.

If a given MDT is fully resolved, that is to say each leaf has a unique path to the root in the

simplified tree, as a consequence of the “safe” operations then the MDT will be ECMP friendly.

If the MDT could not be fully resolved, then one or more “unsafe” operations is required, in the

form of a prune. Any prune will do. However, all entities performing the computation need to

have an agreed “guess criteria” if they are to produce a common result. And as the guess is not

guaranteed to be the right guess, we need to partition the resolved leaves into two sets: those

resolved prior to any guesses and those resolved afterwards. The set of leaves resolved either

by or after a “guess operation” will need to be audited for “ECMP friendliness”. The algorithm

documented in description includes a specific heuristic for making “unsafe” prunes such that

processing can be performed without having to consider the full matrix of intertwined decisions

simultaneously.

Order of Operations

Although if repeated application of “safe” operations result in a fully resolved tree it will be

ECMP friendly, interoperability demands an agreed order of safe operations as otherwise there

will be variations in the resulting MDTs. There are pruning operations have dependencies on

prior pruning results therefore all computing nodes MUST perform the operations in the same

order. This is outlined below.

Initial Steps and Metrics

The very initial step is to perform a shortest path computation without any tie breaking from

the root to all nodes in the network.

The for a given S,G tree, walk back the path from the leaves to the root. With each link

associate which leaves would potentially be served by the link. The following diagram illustrates

this:

Figure 3

In this graph, links are of equal weight, and the distance from the root is marked beside each

node. Each leaf has been assigned a unique color. To take a single link from the example, it has

been determined that link 12‐8 COULD serve leaves 4, 13, 14 and 16.

At this point all links that cannot possibly serve any leaf in the MDT can be eliminated, which

leaves:

17

14

9

13

1

1

1

2

2

2

2

2

3

3

3

3

4

4

4

4

16

12

61

11

10

5

4

3

7

8

2

15

Figure 4

The preferred procedure to get a consistent result is to:

1) Perform an initial simplification of the graph via simplification. This involves the

elimination of all non‐ candidate replication points and triangles.

2) Perform upstream pruning. This is performed in repeated passes where each pass is

ranked initially by distance from the root and then by node ID at a common

distance, from lowest to highest in both cases. Intuitively performing pruning

operations closer to the root first would simplify operations further from the root as

the upstream topology would be simplified first, and empirical evidence supports

this.

As prunes are performed they will result in further possible simplifications. These

are performed immediately. This not only optimizes performance, as searching for

simplifications is minimized via localization, it also minimizes the number of

17

14

9

13

1

1

1

2

2

2

2

2

3

3

3

3

4

4

4

4

16

12

61

11

10

5

4

3

7

8

2

15

iterations of upstream pruning required to either resolve the tree or determine

“unsafe” operations are required. It should be noted though that a prune can result

in simplifications all the way back to the root, as potential simplifications are not

necessarily immediately adjacent to either the prune or any other resulting

simplifications. When performing upstream pruning for a given node, it is best to

mark all upstream adjacencies for pruning prior to checking for simplifications that

result from the prunes. This minimizes redundant intermediate simplification steps.

3) If an upstream pruning pass results in no prunes having been performed and the

MDT is not fully resolved, then a “guess” operation is required. A “guess” prune is

performed followed by repeated iterations of “safe” upstream pruning until again no

further prunes are possible or the tree is resolved.

4) If a guess operation was required. All nodes still in the graph that transit leaves

resolved by or after the “guess” operation need to be checked for ECMP friendliess.

“Safe” prunes and simplifications

The “safe” prunes and simplifications will be pretty self evident once described. The key

concept that drives them is we are seeking to simplify the MDT down to just the roots,

replication points and leaves.

The following outlines the “safe” operations in more detail:

1) Elimination of non‐replicating nodes: Any node which is not a candidate replication point

can be simplified into being a link. Most simply, if the set of potentially served leaves is the

same on all interfaces, the node will not play into the final topology and can be replaced

with a link. The wrinkle is if there is more than two equivalent links from the node to be

eliminated, it results in the substitution of multiple links into the topology. Consider the

following examples.

First the trivial case:

Node 2 is simply transit and can be eliminated, resulting in:

1 32

Second a slightly more complicated case. Node 3 can be eliminated but this results in the

addition of multiple links to replace it,

With the following result

Third, where the eliminated node has multiple equivalent interfaces both upstream and

down…

Eliminating node 3 requires a full mesh of links to connect the upstream to the downstream

peers, resulting in:

1 3

1

2

3 4

1

2

4

1

2

4

5

3

1

2

4

5

And finally it needs to be noted that a simple parallel links case may be the result of

simplifications, and that can be simplified further to a single link. In the example below,

nodes 2 and 3 are replaced by links:

Which leaves parallel paths between 1 and 4:

Which can again be simplified to a single link between them:

The overall guiding principle is “if it is something a tunnel will simply overlay, simplify it

out.”

Applying this to the previous network diagram we can replace nodes 1, 3 and 10 resulting in

the following:

1 4

2

3

1 4

1 4

Figure 5

Triangles

A triangle is a scenario whereby between two nodes on the shortest path there are multiple

possible paths. One of these nodes is closer to the root, and the other is further from the

root. One or more of them is a single link, and one or more of them has multiple hops

implying candidate replication points. The single links can be eliminated.

The rationale is fairly simple, if any of the candidate replication points between the two

nodes of interest ultimately resolve to being actual replication points in the final tree, they

will be further from the root than closer to the root node of the pair. And if they do not

result in replication points, they will be simplified out to simply being a link anyway.

In the following example we have a triangle

17

14

9

13

1

1

1

2

2

2

3

3

3

4

4

4

4

16

12

6

11

5

4

8

2

15

7

Which we can simplify by eliminating the link from 1 to 4, resulting in:

Applying this rule to our network diagram, we can eliminate links 12‐4, 6‐13, 8‐13, and 11‐13

resulting in:

Figure 6

Which them permits nodes 6 and 12 to be replaced with links as they are no longer possible

replication points:

1 4

3

1 4

3

17

14

9

13

1

1

1

2

2

2

3

3

3

4

4

4

4

16

12

6

11

5

4

7

8

2

15

Figure 7

Which then permits link 9‐11 to be eliminated as a triangle:

Figure 8

17

14

131

2

2

2

3

3

3

4

4

4

4

16

11

5

4

7

8

2

15

9

17

14

9

131

2

2

2

3

3

3

4

4

4

4

16

11

5

4

7

8

2

15

Which then (in the interests of brevity) permits node 11 to be eliminated as simply a transit

node, and the resulting link 7‐5 to be eliminated as a triangle, which then leaves nodes 7 as a

transit node which are replaced by links 9‐15.

Figure 9

At this point I the example, it is probably worth noting that 4, 14, 15 and 17 are “resolved” in

that they have a unique shortest path to the root. 13 and 16 are yet to be resolved.

Pruning of Upstream Links

The pruning of upstream links at a high level is the elimination of possibilities that do not make

sense when the objective is a minimum cost tree. For each node in the current tree, the pruning

rule is to establish the closest upstream node that has to be in the topology, and eliminate all

links where the closest node is closer to the root.

The closest node that has to be in the topology can exist in two forms, one is a leaf, another is

the concept of a pinned path. A node that transits a unique path to a root, or is the closest

point to the root downstream of which unique connectivity to a leaf exists is a component that

has to be in the final MDT, so is considered to be equal to a leaf.

The actual precedence given leafs or pinned paths equidistant towards the root is:

17

14

9

13

2

2

3

3

3

4

4

4

4

16

5

4

8

2

15

‐ A leaf is preferred over a pinned path, when a tie occurs where there is a leaf and a

pinned path, the adjacency to the pinned path is pruned

‐ An adjacency with a closer candidate replication point and a leaf or pinned path of equal

distance with the “best” upstream adjacency directly connected to a leaf or pinned

path, the node ID of leaf or pinned path upstream of the replication node will be

included in any ranking decision of an upstream tie. The distinction is that the adjacency

with the candidate replication point would not be eliminated as a result of an inferior

ranking.

Figure 10

Figure 10 illustrates the precedence relationships. Nodes 0, 2, 3, 4, 8, 12 and 17 are leaves.

Nodes 7, 10, 13 and 16 are candidate replication points. Pinned paths transit 7 (to leaf 0), 13 (to

leaf 12) and 16 (to leaf 17).

When pruning from 2:

1. Adjacency 2‐16 can be deleted, node 16 is closer to the root than the closest pinned

component to the current node (in this case pinned paths via 13 and 7, and leaves 4 and

8)

‐ Adjacencies 2‐7 and 2‐13 are of equal class in that both are pinned nodes due to unique

paths to 0 and 12 respectively. As we currently keep the adjacency to the lowest node

ID when all other things are equal, we would delete 2‐13.

‐ Adjacencies 2‐10 and 2‐8 are of equal class, and superior to adjacencies 2‐7 and 2‐13

(leaves being superior to pinned paths). Therefore adjacencies 2‐7, and 2‐13 would be

pruned.

‐ Adjacency 2‐10 is considered superior to adjacency 2‐8, as nodes 4 and 8 are of equal

class(both leaves), but 4 is the lower ID number

‐

So if all the nodes and adjacencies in the above diagram were present, according to the pruning

rules, this would be pruned down to…

Figure 12

Note that any adjacency that has a candidate node closer to the pruning node than any pinned

component will NOT be pruned. For example, if the node IDs 4 and 8 were reversed, adjacency

2‐10‐8 would not be deleted despite 2‐4 having a lower node ID as there was a possibility that

subsequent prunes may change 10’s role making 2‐10 a superior adjacency to 2‐4.

Applying this to our network example, both nodes 13 and 16 have equal choices. 13 has equal

choices because both nodes 2 and 5 are on pinned paths, 2 is pinned by traffic to node 14, and

5 is pinned by traffic to node 17. 16 has unequal choices either via 4 (leaf) or via 2 (pinned

path) where the path via 4 will be preferred.

Figure 13

And with the further simplification of eliminating node 5 as being purely transit provides the

resulting MDT; all leaves have a unique path to the root.

“unsafe” pruning

If repeated application of “safe” operations has not completely resolved the MDT it will be

because there are prunes that require the coordination of the decisions of multiple nodes. The

strategy to deal with this is to make a single prune based upon a heuristic that will have a high

probability of multiple nodes making a common decision, and then revert to “safe” operations

to resolve the tree. This becomes a “rinse and repeat” operation. The premise is that if the

17

14

9

13

2

2

3

3

3

4

4

4

4

16

5

4

8

2

15

“guess” was correct, all subsequent “safe” operations will be consistent with it and we will end

up with a n ECMP friendly tree that requires no corrections.

Prior to making the first “unsafe” prune, we need to identify the leaves that have already been

fully resolved. They will require no further consideration. All leaves that are resolved either by

the first “unsafe” prune or subsequent prunes and simplifications will need to be checked for

ECMP friendliness.

The “unsafe” prune utilized was to select the node closest to the root that had multiple

upstream adjacencies. The heuristic for selecting which upstream path to keep was to select

the path with the densest count of potentially served leaves (known as the PSL count). As the

immediate adjacencies will all have an equal PSL count, this involves walking backwards

towards the root until the closest point with the highest PSL count is found.

Figure 14

Figure 14 is an example of an MDT that could not be resolved purely by “safe” operations. Both

nodes 10 and 8 have upstream links where a “potential replication point” is closer than any

pinned component. The diagram is drawn to illustrate the relative metrics so, for example,

node 0 is closer to the root and further from node 10 than node 11 is.

The closest unresolved node is 10. Walking back both upstream adjacencies, the next

adjacencies (0 to root, and 11 to 7) both have PSL counts of 2. But node 11 being closer to node

10 will inform the decision to prune the link 0‐10, resulting in:

7 8

root

9

0

6

10

11

Figure 15

Similar to “safe” pruning, any consequent simplifications of the graph need to be performed. In

this particular example there were not any.

At this point resuming the application of “safe operations” occurs. The view from 8 is that there

is pinned path at 11, therefore all adjacencies with components closer to the root than 11 can

be pruned, eliminating the link from 8 to 6. At this point all leaves will have a unique path to the

root.

Figure 16

Tree Audit

In the previous example, leaves 8 and 10 were not resolved by the initial set of “safe”

operations, therefore they and all intermediate nodes they transit need to be checked for

7 8

root

9

0

6

10

11

7 8

root

9

0

6

10

11

“ECMP friendliness”. In that example, there is not a problem as on the shortest path, node 11 is

closer to 8 and 10 than any other nodes in the tree.

We could also consider the situation illustrated below, whereby for whatever reason we ended

up with this as the MDT, and this included resolving node 12 after an “unsafe” prune was

performed. We will postulate for this example that leaves 0, 2 and 9 were resolved by “safe”

rules, and therefore do not need checking.

Figure 17

In this case, there is an MDT component (node 9) that is both closer to 12 than its first

upstream component (node 8) and is downstream of node 8. The issue with this arrangement

being that a tunnel to 12 could be via node 9, implying two copies of a packet from the root

transited 8‐9

Therefore the tree needs to be modified. 12 is connected to the closer node (node 9), and the

adjacency to 8 is eliminated. This also results in the further simplification that 8, which is no

longer a replication node, can be replaced with a link. The corrected tree is:

Figure 18

“Early exit” when computing

The big optimization in any implementation is ultimately going to be how little the code has to

compute. This means when the computing node knows what its role is with respect to a given

MDT, which includes no role, it can cease processing. For example, a node computing a given

MDT can terminate processing of a given (S,G) tree as soon as it is removed from the current

intermediate product of determining the S,G tree. It will not have a role, nor will need to install

state, so the node is free to move onto the next G. Or if all leaves that are downstream of the

computing node have been resolved by “safe” operations, the node is done as it has sufficient

knowledge to determine what state it would need to install, it does not need to fully resolve

the entire tree, it can tell when what it needs to know has been determined.

Software Implementation

I won’t say a lot about implementation, except that the above already includes some specific

requirements that also happen to be a degree of optimization. Key thing IMO is that the

description should scream “bitmaps” at you. Keeping track of potential served leaves, pinned

paths and fully resolved leaves is most easily done with bitmap operations. So I do tend to think

of this as “BIER in the control plane”.

Dataplane Implementation

At a top level, the technique assumes that an MPLS forwarding engine can resolve a single

multicast SID label into multiple next hops, each of which may be represented by multiple

NHLFEs which then need further resolution via ECMP processing. This may be beyond existing

implementations. In this scenario, the control plane can explicitly select the set of NHLFEs with

no ECMP processing for the first hop. This is a nodal behavior that has no effect on the

surrounding nodes in the MDT. The only consequence is that the load spreading of the first hop

of the MDT past a node implementing this option may be diminished.

Terms and Acronyms

ECMP – equal cost multipath

ECMP friendly – an MDT where the closest replication point to a leaf, has no closer leaves or

replication points on the shortest path between them

MDT – multicast distribution tree

PSL – Potentially Served Leaf

S,G – an MDT for G rooted at S

S,* ‐ a tree from S to all other nodes in the network

dave allan, ericsson · a tutorial on the pruning algorithm from...

Documents