dave allan, ericsson · a tutorial on the pruning algorithm from...
TRANSCRIPT
A Tutorial on the Pruning Algorithm from
draft‐allan‐spring‐mpls‐multicast‐framework‐02
Dave Allan, Ericsson
Overview
This document is intended as a companion to “A Framework for Computed Multicast applied to
MPLS based Segment Routing”, a.k.a draft‐allan‐spring‐mpls‐multicast‐framework‐02, published at
www.ietf.org. The purpose was to elaborate on the algorithm descriptions in the document in order to
assist the development of interoperable implementations.
Background
The origin of the concept was 802.1aq/RFC6329 Shortest Path Bridging. The notion was that a
separate signaling protocol for multicast was dispensed with and multicast membership
information disseminated in the IGP. 802.1aq incorporated a tie breaking algorithm that
permitted an (S,*) template to be established from which (S,G) trees could be derived.
A little over a year ago the suggestion was floated to apply the 802.1aq algorithm to SPRING as
with domain wide SIDs, all the underlying pieces were there, and the exercise would be
comparatively trivial and the SPRING paradigm of a single control protocol could be preserved.
This starting point resulted in a realization that as the dataplane included an apriori mesh of
unicast tunnels and a technology (label stacking) that made utilizing tunnels at least
superficially a trivial operation, it would be desirable to incorporate these into MDT
construction. The primary motivations being
‐ Dataplane state reduction, both less state in the LIBs as well as less N‐S synchronization
‐ Leverage unicast recovery for most failures
‐ Reuse of the MPLS dataplane
‐ The use of tunnels meant some computational complexity could be avoided by
detecting when the computing node would not need to install state for a given tree, and
therefore not processing it to completion
A consequence of this was that simply using an (S,*) template tree would not work well with
ECMP, which serendipitously resulted in requiring minimum cost or near minimum cost trees to
address adding bandwidth efficiency to the list of virtues of the approach.
The Problem Space
If we are to construct MDTs that have unicast tunnels that may be subject to ECMP treatment
as a component, we would like to avoid multiple copies of a multicast packet transiting the
same link. The following diagram illustrates the problem.
Figure 1
In figure 1, node 9 is the root, and nodes 4, 13, 14, 15 16 and 17 are leaves. In the example all
link weights are equal.
By whatever tie breaking mechanism an MDT illustrated via the green lines has been
constructed (in this case, using tie breaking rules from RFC 6329/802.1aq). This results in
replication points at nodes 2, 6 and 12 for this particular set of leaves. The red lines illustrate
where unicast tunnels could be employed, and highlights the problem; which is that the
multiple copies of a multicast packet scenario could occur in multiple places: on the adjacency
from 12 to 8, and the adjacency from 6 to 11 to 5. ECMP spreading of traffic of the tunnel from
6 to 13 and from 12 to 4 includes interfaces used by other tunnels in the MDT in the set of
possible outcomes. This is undesirable from the point of view of both inefficiency, and
implementation complexity.
17
9
1
1
1
2
2
2
2
2
3
3
3
3
4
4
4
45
11
8
12
6 2
3
7
1
1014
13
15
16
4
Given the same topology and an MDT with the same roots and leaves, the following is an
example of an MDT that is ECMP friendly:
Figure 2
To contrast the two trees, the second involves less nodes installing state, and fewer hops
traversed by copies of any given multicast packet (11 vs. 16).
There is a couple of things to observe in this.
First, if ECMP is not considered, we could probably live with the inefficiencies of an (S,*)
template tree whereas generating ECMP friendly trees that can employ tunnels without any
duplication on interfaces requires a unique (S,G) solution.
Second the (S,G) solution I would characterize as “near minimum cost” as ECMP friendly is not
necessarily absolute minimum cost (although I do not have an example in my back pocket of an
ECMP friendly non‐minimum‐cost tree).
The Algorithm Concept
17
14
16
9
13
4
1
1
1
2
2
2
2
2
3
3
3
3
4
4
4
415
8
2
12
61
11
10
5
3
7
The algorithm is predicated on the notion that there are two classes of prunes and
simplification steps, those if repeatedly applied and successfully fully resolve the tree will
generate in an ECMP friendly result, and those that are not guaranteed to produce an ECMP
friendly result. The former we will refer to as “safe” operations and the latter as “unsafe”. The
objective is to simplify the multicast tree to being an acyclic tree that only has only the root,
leaves and replication points as vertices. A different characterization of the two classes of
operation would be that “safe” decisions that can be performed only considering the state of a
single node, and a successful result from “unsafe” decisions depend on the coordination of the
decisions of multiple nodes.
If a given MDT is fully resolved, that is to say each leaf has a unique path to the root in the
simplified tree, as a consequence of the “safe” operations then the MDT will be ECMP friendly.
If the MDT could not be fully resolved, then one or more “unsafe” operations is required, in the
form of a prune. Any prune will do. However, all entities performing the computation need to
have an agreed “guess criteria” if they are to produce a common result. And as the guess is not
guaranteed to be the right guess, we need to partition the resolved leaves into two sets: those
resolved prior to any guesses and those resolved afterwards. The set of leaves resolved either
by or after a “guess operation” will need to be audited for “ECMP friendliness”. The algorithm
documented in description includes a specific heuristic for making “unsafe” prunes such that
processing can be performed without having to consider the full matrix of intertwined decisions
simultaneously.
Order of Operations
Although if repeated application of “safe” operations result in a fully resolved tree it will be
ECMP friendly, interoperability demands an agreed order of safe operations as otherwise there
will be variations in the resulting MDTs. There are pruning operations have dependencies on
prior pruning results therefore all computing nodes MUST perform the operations in the same
order. This is outlined below.
Initial Steps and Metrics
The very initial step is to perform a shortest path computation without any tie breaking from
the root to all nodes in the network.
The for a given S,G tree, walk back the path from the leaves to the root. With each link
associate which leaves would potentially be served by the link. The following diagram illustrates
this:
Figure 3
In this graph, links are of equal weight, and the distance from the root is marked beside each
node. Each leaf has been assigned a unique color. To take a single link from the example, it has
been determined that link 12‐8 COULD serve leaves 4, 13, 14 and 16.
At this point all links that cannot possibly serve any leaf in the MDT can be eliminated, which
leaves:
17
14
9
13
1
1
1
2
2
2
2
2
3
3
3
3
4
4
4
4
16
12
61
11
10
5
4
3
7
8
2
15
Figure 4
The preferred procedure to get a consistent result is to:
1) Perform an initial simplification of the graph via simplification. This involves the
elimination of all non‐ candidate replication points and triangles.
2) Perform upstream pruning. This is performed in repeated passes where each pass is
ranked initially by distance from the root and then by node ID at a common
distance, from lowest to highest in both cases. Intuitively performing pruning
operations closer to the root first would simplify operations further from the root as
the upstream topology would be simplified first, and empirical evidence supports
this.
As prunes are performed they will result in further possible simplifications. These
are performed immediately. This not only optimizes performance, as searching for
simplifications is minimized via localization, it also minimizes the number of
17
14
9
13
1
1
1
2
2
2
2
2
3
3
3
3
4
4
4
4
16
12
61
11
10
5
4
3
7
8
2
15
iterations of upstream pruning required to either resolve the tree or determine
“unsafe” operations are required. It should be noted though that a prune can result
in simplifications all the way back to the root, as potential simplifications are not
necessarily immediately adjacent to either the prune or any other resulting
simplifications. When performing upstream pruning for a given node, it is best to
mark all upstream adjacencies for pruning prior to checking for simplifications that
result from the prunes. This minimizes redundant intermediate simplification steps.
3) If an upstream pruning pass results in no prunes having been performed and the
MDT is not fully resolved, then a “guess” operation is required. A “guess” prune is
performed followed by repeated iterations of “safe” upstream pruning until again no
further prunes are possible or the tree is resolved.
4) If a guess operation was required. All nodes still in the graph that transit leaves
resolved by or after the “guess” operation need to be checked for ECMP friendliess.
“Safe” prunes and simplifications
The “safe” prunes and simplifications will be pretty self evident once described. The key
concept that drives them is we are seeking to simplify the MDT down to just the roots,
replication points and leaves.
The following outlines the “safe” operations in more detail:
1) Elimination of non‐replicating nodes: Any node which is not a candidate replication point
can be simplified into being a link. Most simply, if the set of potentially served leaves is the
same on all interfaces, the node will not play into the final topology and can be replaced
with a link. The wrinkle is if there is more than two equivalent links from the node to be
eliminated, it results in the substitution of multiple links into the topology. Consider the
following examples.
First the trivial case:
Node 2 is simply transit and can be eliminated, resulting in:
1 32
Second a slightly more complicated case. Node 3 can be eliminated but this results in the
addition of multiple links to replace it,
With the following result
Third, where the eliminated node has multiple equivalent interfaces both upstream and
down…
Eliminating node 3 requires a full mesh of links to connect the upstream to the downstream
peers, resulting in:
1 3
1
2
3 4
1
2
4
1
2
4
5
3
1
2
4
5
And finally it needs to be noted that a simple parallel links case may be the result of
simplifications, and that can be simplified further to a single link. In the example below,
nodes 2 and 3 are replaced by links:
Which leaves parallel paths between 1 and 4:
Which can again be simplified to a single link between them:
The overall guiding principle is “if it is something a tunnel will simply overlay, simplify it
out.”
Applying this to the previous network diagram we can replace nodes 1, 3 and 10 resulting in
the following:
1 4
2
3
1 4
1 4
Figure 5
Triangles
A triangle is a scenario whereby between two nodes on the shortest path there are multiple
possible paths. One of these nodes is closer to the root, and the other is further from the
root. One or more of them is a single link, and one or more of them has multiple hops
implying candidate replication points. The single links can be eliminated.
The rationale is fairly simple, if any of the candidate replication points between the two
nodes of interest ultimately resolve to being actual replication points in the final tree, they
will be further from the root than closer to the root node of the pair. And if they do not
result in replication points, they will be simplified out to simply being a link anyway.
In the following example we have a triangle
17
14
9
13
1
1
1
2
2
2
3
3
3
4
4
4
4
16
12
6
11
5
4
8
2
15
7
Which we can simplify by eliminating the link from 1 to 4, resulting in:
Applying this rule to our network diagram, we can eliminate links 12‐4, 6‐13, 8‐13, and 11‐13
resulting in:
Figure 6
Which them permits nodes 6 and 12 to be replaced with links as they are no longer possible
replication points:
1 4
3
1 4
3
17
14
9
13
1
1
1
2
2
2
3
3
3
4
4
4
4
16
12
6
11
5
4
7
8
2
15
Figure 7
Which then permits link 9‐11 to be eliminated as a triangle:
Figure 8
17
14
131
2
2
2
3
3
3
4
4
4
4
16
11
5
4
7
8
2
15
9
17
14
9
131
2
2
2
3
3
3
4
4
4
4
16
11
5
4
7
8
2
15
Which then (in the interests of brevity) permits node 11 to be eliminated as simply a transit
node, and the resulting link 7‐5 to be eliminated as a triangle, which then leaves nodes 7 as a
transit node which are replaced by links 9‐15.
Figure 9
At this point I the example, it is probably worth noting that 4, 14, 15 and 17 are “resolved” in
that they have a unique shortest path to the root. 13 and 16 are yet to be resolved.
Pruning of Upstream Links
The pruning of upstream links at a high level is the elimination of possibilities that do not make
sense when the objective is a minimum cost tree. For each node in the current tree, the pruning
rule is to establish the closest upstream node that has to be in the topology, and eliminate all
links where the closest node is closer to the root.
The closest node that has to be in the topology can exist in two forms, one is a leaf, another is
the concept of a pinned path. A node that transits a unique path to a root, or is the closest
point to the root downstream of which unique connectivity to a leaf exists is a component that
has to be in the final MDT, so is considered to be equal to a leaf.
The actual precedence given leafs or pinned paths equidistant towards the root is:
17
14
9
13
2
2
3
3
3
4
4
4
4
16
5
4
8
2
15
‐ A leaf is preferred over a pinned path, when a tie occurs where there is a leaf and a
pinned path, the adjacency to the pinned path is pruned
‐ An adjacency with a closer candidate replication point and a leaf or pinned path of equal
distance with the “best” upstream adjacency directly connected to a leaf or pinned
path, the node ID of leaf or pinned path upstream of the replication node will be
included in any ranking decision of an upstream tie. The distinction is that the adjacency
with the candidate replication point would not be eliminated as a result of an inferior
ranking.
Figure 10
Figure 10 illustrates the precedence relationships. Nodes 0, 2, 3, 4, 8, 12 and 17 are leaves.
Nodes 7, 10, 13 and 16 are candidate replication points. Pinned paths transit 7 (to leaf 0), 13 (to
leaf 12) and 16 (to leaf 17).
When pruning from 2:
1. Adjacency 2‐16 can be deleted, node 16 is closer to the root than the closest pinned
component to the current node (in this case pinned paths via 13 and 7, and leaves 4 and
8)
‐ Adjacencies 2‐7 and 2‐13 are of equal class in that both are pinned nodes due to unique
paths to 0 and 12 respectively. As we currently keep the adjacency to the lowest node
ID when all other things are equal, we would delete 2‐13.
‐ Adjacencies 2‐10 and 2‐8 are of equal class, and superior to adjacencies 2‐7 and 2‐13
(leaves being superior to pinned paths). Therefore adjacencies 2‐7, and 2‐13 would be
pruned.
‐ Adjacency 2‐10 is considered superior to adjacency 2‐8, as nodes 4 and 8 are of equal
class(both leaves), but 4 is the lower ID number
‐
So if all the nodes and adjacencies in the above diagram were present, according to the pruning
rules, this would be pruned down to…
Figure 12
Note that any adjacency that has a candidate node closer to the pruning node than any pinned
component will NOT be pruned. For example, if the node IDs 4 and 8 were reversed, adjacency
2‐10‐8 would not be deleted despite 2‐4 having a lower node ID as there was a possibility that
subsequent prunes may change 10’s role making 2‐10 a superior adjacency to 2‐4.
Applying this to our network example, both nodes 13 and 16 have equal choices. 13 has equal
choices because both nodes 2 and 5 are on pinned paths, 2 is pinned by traffic to node 14, and
5 is pinned by traffic to node 17. 16 has unequal choices either via 4 (leaf) or via 2 (pinned
path) where the path via 4 will be preferred.
Figure 13
And with the further simplification of eliminating node 5 as being purely transit provides the
resulting MDT; all leaves have a unique path to the root.
“unsafe” pruning
If repeated application of “safe” operations has not completely resolved the MDT it will be
because there are prunes that require the coordination of the decisions of multiple nodes. The
strategy to deal with this is to make a single prune based upon a heuristic that will have a high
probability of multiple nodes making a common decision, and then revert to “safe” operations
to resolve the tree. This becomes a “rinse and repeat” operation. The premise is that if the
17
14
9
13
2
2
3
3
3
4
4
4
4
16
5
4
8
2
15
“guess” was correct, all subsequent “safe” operations will be consistent with it and we will end
up with a n ECMP friendly tree that requires no corrections.
Prior to making the first “unsafe” prune, we need to identify the leaves that have already been
fully resolved. They will require no further consideration. All leaves that are resolved either by
the first “unsafe” prune or subsequent prunes and simplifications will need to be checked for
ECMP friendliness.
The “unsafe” prune utilized was to select the node closest to the root that had multiple
upstream adjacencies. The heuristic for selecting which upstream path to keep was to select
the path with the densest count of potentially served leaves (known as the PSL count). As the
immediate adjacencies will all have an equal PSL count, this involves walking backwards
towards the root until the closest point with the highest PSL count is found.
Figure 14
Figure 14 is an example of an MDT that could not be resolved purely by “safe” operations. Both
nodes 10 and 8 have upstream links where a “potential replication point” is closer than any
pinned component. The diagram is drawn to illustrate the relative metrics so, for example,
node 0 is closer to the root and further from node 10 than node 11 is.
The closest unresolved node is 10. Walking back both upstream adjacencies, the next
adjacencies (0 to root, and 11 to 7) both have PSL counts of 2. But node 11 being closer to node
10 will inform the decision to prune the link 0‐10, resulting in:
7 8
root
9
0
6
10
11
Figure 15
Similar to “safe” pruning, any consequent simplifications of the graph need to be performed. In
this particular example there were not any.
At this point resuming the application of “safe operations” occurs. The view from 8 is that there
is pinned path at 11, therefore all adjacencies with components closer to the root than 11 can
be pruned, eliminating the link from 8 to 6. At this point all leaves will have a unique path to the
root.
Figure 16
Tree Audit
In the previous example, leaves 8 and 10 were not resolved by the initial set of “safe”
operations, therefore they and all intermediate nodes they transit need to be checked for
7 8
root
9
0
6
10
11
7 8
root
9
0
6
10
11
“ECMP friendliness”. In that example, there is not a problem as on the shortest path, node 11 is
closer to 8 and 10 than any other nodes in the tree.
We could also consider the situation illustrated below, whereby for whatever reason we ended
up with this as the MDT, and this included resolving node 12 after an “unsafe” prune was
performed. We will postulate for this example that leaves 0, 2 and 9 were resolved by “safe”
rules, and therefore do not need checking.
Figure 17
In this case, there is an MDT component (node 9) that is both closer to 12 than its first
upstream component (node 8) and is downstream of node 8. The issue with this arrangement
being that a tunnel to 12 could be via node 9, implying two copies of a packet from the root
transited 8‐9
Therefore the tree needs to be modified. 12 is connected to the closer node (node 9), and the
adjacency to 8 is eliminated. This also results in the further simplification that 8, which is no
longer a replication node, can be replaced with a link. The corrected tree is:
Figure 18
“Early exit” when computing
The big optimization in any implementation is ultimately going to be how little the code has to
compute. This means when the computing node knows what its role is with respect to a given
MDT, which includes no role, it can cease processing. For example, a node computing a given
MDT can terminate processing of a given (S,G) tree as soon as it is removed from the current
intermediate product of determining the S,G tree. It will not have a role, nor will need to install
state, so the node is free to move onto the next G. Or if all leaves that are downstream of the
computing node have been resolved by “safe” operations, the node is done as it has sufficient
knowledge to determine what state it would need to install, it does not need to fully resolve
the entire tree, it can tell when what it needs to know has been determined.
Software Implementation
I won’t say a lot about implementation, except that the above already includes some specific
requirements that also happen to be a degree of optimization. Key thing IMO is that the
description should scream “bitmaps” at you. Keeping track of potential served leaves, pinned
paths and fully resolved leaves is most easily done with bitmap operations. So I do tend to think
of this as “BIER in the control plane”.
Dataplane Implementation
At a top level, the technique assumes that an MPLS forwarding engine can resolve a single
multicast SID label into multiple next hops, each of which may be represented by multiple
NHLFEs which then need further resolution via ECMP processing. This may be beyond existing
implementations. In this scenario, the control plane can explicitly select the set of NHLFEs with
no ECMP processing for the first hop. This is a nodal behavior that has no effect on the
surrounding nodes in the MDT. The only consequence is that the load spreading of the first hop
of the MDT past a node implementing this option may be diminished.
Terms and Acronyms
ECMP – equal cost multipath
ECMP friendly – an MDT where the closest replication point to a leaf, has no closer leaves or
replication points on the shortest path between them
MDT – multicast distribution tree
PSL – Potentially Served Leaf
S,G – an MDT for G rooted at S
S,* ‐ a tree from S to all other nodes in the network