cs717 application of ai- and ml-techniques to fault-tolerant routing arjun rao cs 717 november 16...
TRANSCRIPT
CS717
Application of AI- and ML-Techniques to Fault-Tolerant Routing
Arjun Rao
CS 717
November 16 and 18, 2004
CS717
Papers Covered
• [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies”
• [2] Loh, Shaw., “A Genetic-Based Fault-Tolerant Routing Strategy for Multiprocessor Networks”
CS717
Papers Covered (cont.)
• [3] Loh, Schröder, Hsu., “Fault-Tolerant Routing on Complete Josephus Cubes” (not AI-related but interesting nevertheless)
If time permits, also:• [4] Bradley, Tyrrell., “Immunotronics:
Hardware Fault Tolerance Inspired by the Immune System”
CS717
The Problem of Routing
• Communication between nodes– Servers– Microprocessors
• Desire shortest, most efficient paths– Multiprocessor network topologies, e.g.
hypercubes, Josephus cubes, etc.
• Desire availability of paths– What to do when links/nodes fail?– How to remain (close to) optimal?
CS717
Intro to Fault-Tolerant Routing
• Current algorithms adaptive but non-minimal• Misrouting• Routing strategies tied to specific topologies
– k-ary, n-cubes, meshes, etc.: Regular structures and symmetry
– Constrained by fault number and types
• More general strategies vulnerable to deadlock and livelock
CS717
“Turn Model” [Glass, Ni]
• Widest application scope– k-ary, n-cubes, nD-meshes, torus geometries, etc.
• “West-First” algorithm (on 2D-mesh)– Messages prevented from turning “west” again– Prevents cyclesdeadlocks– Routing along virtual channels in strictly
decreasing or increasing order
CS717
Turn Model (cont.)
• Three examples of routing
• “F” = FAILURE
• Full adaptation w/o deadlock and livelock requires more global infomore overhead
CS717
AI Search Techniques
• Arbitrary topology Search space• Search space Search tree(s)• Adaptive but still non-minimal
• Characteristic recursion impractical on loosely-coupled, distributed network
CS717
AI Logical Abstraction
• Abstraction:– S: Problem space– O: Set of objectives– P: Search paths
– S = (O, P), where oi O and pj P, each pj connects tuple (ok, ol), k l
Abstraction used to model…
CS717 Multiprocessor Network w/ Generic
Topology• Network
– N: Nodes– L: Links between nodes
– G = (N, L), where ni N and lj L, each lj connects tuple (nk, nl), k l
• Objective Node• Search path Link
CS717
Abstract Routing Model
• Search : (os, ot): S x S S*, where S = (O, P) and S* =
(O*, P*)– ox,oy O and ox,oy O* Successful search– ox,oy O and ox O*, oy O* Unsuccessful
• Routing attempt R:– R(ns, nd): G x G G*, where G = (N, L) and G* =
(N*, L*)– ni,nj N and ni,nj N* Complete route– ni,nj N and ni N*, nj N* Incomplete
CS717
Routing Analogy
• AI search equivalent to routing attempt
• Successful search Route between source and destination nodes
• Unsuccessful search Incomplete route to destination
CS717
Caveats of Analogy
• No specific search algorithm No routing strategy
• No optimality constraints• Nothing about deadlocks/livelocks
• Nothing about fault tolerance!!
CS717
Fault-Tolerant Routing Model
• Model considers two aspects:– Routing system configuration
• Must be generic enough!
– Message propagation protocols and policies
• Following slides introduce what is needed for AI searches (w/ physical message backtracking)
CS717
FT Routing Model (cont.)
• Eager readership of input messages• Single input buffer to avoid polling• Multiple output buffers to accommodate different
delivery rates• Router process:
– AI/FT routing strategy implemented here– Physical message backtracking Increased message sizes– Increased message sizes/overhead Requires
communications router at each node
CS717
Communications Router (cont.)
• Communication router constitutes router process and connections
• Main components: LCM and CP• ROM: Stores link management and routing
software• RAM: Stores routing table, link status table,
associated link lists
CS717
CR Routing Table
• For each node, up to n links• For each link:
– Connected with status OK and node ID of neighbor
– Not connected with status NC and node ID –1
• Link fault represented by timeout:– Status reset to NC
• Processor fault represented by timeouts in neighbors
CS717
Message Packets
• Six fields:– Router Control (4 bits): Type of message,
including NORMAL and BACKTRACK– Destination Node ID (10 bits): Supports network
of size up to 1024 nodes– Pending Nodes (20 bytes): Stack of node IDs that
may receive packet but have not yet– Traversed Nodes (20 bytes): Stack of nodes
traversed, with most recent on top
CS717
Message Packets (cont.)
– Traversed Nodes Index (10 bits): Index to previous traversed nodes field. Supports simulation of physical message backtracking
– Data Field (n-bit pointer): Points to information content of packet
CS717
(Finally) AI Search Strategies
• Brute Force:– Depth-First Search– Random Climbing
• Heuristic:– Hill Climbing– Best-First Search– A*
CS717
AI Search Strategies (cont.)
• In presence of network faults:– Prevent cycles No deadlocks– Prevent more than two traversals of nodes/links
No livelocks and necessary for AI searches
• Adaptations of search algorithms• Problems:
– Recursion? Nope (PMB)– Overhead? Fixed (Well, mostly…)
CS717
Common Beginning
Extracts header and disassembles itIF Destination Node is reached, pass packet to host processorELSE
IF Router Control is BACKTRACKIF Pending Nodes top node is directly linked Route packet to that node Set Router Control to NORMALELSE Backtrack packet to previous node in traversed
Pop current node ID from Pending NodesPush current node ID onto Traversed Nodes
CS717
Depth-First Search
• Travel as far as possible– Do not consider alternative paths just yet
• If fault or dead-end, backtrack to most recent possible path
CS717
DFS (cont.)
Following common beginning:
Look for directly linked successor nodes IF they are already traversed, ignore ELSE IF they are in Pending Nodes, ignore ELSE push them onto Pending NodesRead top node of Pending NodesIF directly linked (no fault), route packet to itELSE Set BACKTRACK and route to last traversed nodeEND
CS717
Random Climbing
Following the common beginning:
…
ELSE
Select a successor node randomly
Push unselected successor nodes onto Pending Nodes
…
CS717
Hill Climbing
• Heuristic: Estimated remaining distance
Following common beginning:
…
ELSE
Sort successor nodes according to est. remaining distance
Push sorted nodes onto Pending Nodes
…
CS717
Best-First Search
• Resumes partial routes not previously considered
• Looks at immediate neighbors, neighbors of predecessors– Sorts by est. remaining distance
• Leads to non-minimal routes!
CS717
BFS (cont.)
…
ELSE
Push (directly linked successor nodes) onto Pending Nodes
Sort Pending Nodes according to est. remaining distance
…
CS717
A*
• Two heuristics:– Estimated remaining distance: h– Path length traversed: g
• Partial paths sorted by f = g + h
• When no faults, always finds minimal route
CS717
A* (cont.)
After current ID processing:
Record path length traversed, g
…
ELSE
Calculate and store f for new successor nodes
Push them onto Pending Nodes sorted by f
…
CS717
Performance Testing
• Simulated 125-node multiprocessor network• Max 8 links per node (maps to many
topologies)• Faulty links and processors
– Pre-specified or dynamically generated
• Testing:– Messages between every pair of nodes– 20 trials at 0%, 5%, 10%, 15%, 20% faulty links– 125 x 125 x 20 x 6 = 1,875,000 tests (??)
CS717
Test Results
• As faults increase, heuristic strategies fair better (esp. > 15%)
• A* best search technique but slow
• Hill climbing and BFS do not consider nodes traversed– Hill climbing considers only immediate neighbors
CS717
Main Point
Using AI search techniques, we abstract from routing in networks to searching in trees (topology-independent, quantity and type of faults irrelevant)
CS717
Next Paper
• [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies”
• [2] Loh, Shaw., “A Genetic-Based Fault-Tolerant Routing Strategy for Multiprocessor Networks”
CS717
Our Little Problem…
• AI search techniques topology- and fault-type independent…
• …but non-minimal routes utilized
• Follow-up work shows how genetic algorithms (combined with heuristics) can find minimal routes in presence of network faults
CS717
Genetic Algorithms: Overview
• Optimization strategy• Population of potential solutions evolve over
series of generations• Each element of population is chromosome;
each unit of chromosome is gene• Chromosomes undergo crossover and
mutation• Most fit chromosomes selected for next
generation, based upon fitness function
CS717
Abstract Model
• Same as before (including definitions of S and G)
• Pure abstraction suffers from same caveats as before
• Basic idea: Instead of AI search for adaptive route, optimize over population of routes to find best
CS717
Chromosome
• Route Chromosome• Node on route Gene in chromosome
• Length of route Size of chromosome– Chromosome size directly reflects routing
performance!
• Distance traversed basis of fitness
CS717
Mutation and Crossover
• Mutation: Swap and/or shift• Normal crossover destroys routes, messes
with source and destination; problem w/ different lengths– Use one-point random crossover
CS717
Fitness Function
• F = (Dmax – Droute) / Dmax + – Dmax: Maximum distance between source and
destination
– Droute: Distance traveled by specific route
: Predefined value to ensure non-zero fitness
• Higher value More fit
CS717
Selection Scheme
• Roulette Wheel– Sum of fitness values * random value from [0,1]– Select chromosomes with fitness greater than product
• Tournament Selection– Most fit chromosomes selected
• Stochastic Remainder– Probabilities used to select route
• Which scheme has best performance selecting optimal route?