cs717 application of ai- and ml-techniques to fault-tolerant routing arjun rao cs 717 november 16...

53
CS71 7 Application of AI- and ML- Techniques to Fault-Tolerant Routing Arjun Rao CS 717 November 16 and 18, 2004

Upload: adele-jefferson

Post on 02-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

CS717

Application of AI- and ML-Techniques to Fault-Tolerant Routing

Arjun Rao

CS 717

November 16 and 18, 2004

CS717

Papers Covered

• [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies”

• [2] Loh, Shaw., “A Genetic-Based Fault-Tolerant Routing Strategy for Multiprocessor Networks”

CS717

Papers Covered (cont.)

• [3] Loh, Schröder, Hsu., “Fault-Tolerant Routing on Complete Josephus Cubes” (not AI-related but interesting nevertheless)

If time permits, also:• [4] Bradley, Tyrrell., “Immunotronics:

Hardware Fault Tolerance Inspired by the Immune System”

CS717

The Problem of Routing

• Communication between nodes– Servers– Microprocessors

• Desire shortest, most efficient paths– Multiprocessor network topologies, e.g.

hypercubes, Josephus cubes, etc.

• Desire availability of paths– What to do when links/nodes fail?– How to remain (close to) optimal?

CS717

Intro to Fault-Tolerant Routing

• Current algorithms adaptive but non-minimal• Misrouting• Routing strategies tied to specific topologies

– k-ary, n-cubes, meshes, etc.: Regular structures and symmetry

– Constrained by fault number and types

• More general strategies vulnerable to deadlock and livelock

CS717

“Turn Model” [Glass, Ni]

• Widest application scope– k-ary, n-cubes, nD-meshes, torus geometries, etc.

• “West-First” algorithm (on 2D-mesh)– Messages prevented from turning “west” again– Prevents cyclesdeadlocks– Routing along virtual channels in strictly

decreasing or increasing order

CS717

Turn Model and Channel Numbering

CS717

Turn Model (cont.)

• Three examples of routing

• “F” = FAILURE

• Full adaptation w/o deadlock and livelock requires more global infomore overhead

CS717

AI Search Techniques

• Arbitrary topology Search space• Search space Search tree(s)• Adaptive but still non-minimal

• Characteristic recursion impractical on loosely-coupled, distributed network

CS717

AI Logical Abstraction

• Abstraction:– S: Problem space– O: Set of objectives– P: Search paths

– S = (O, P), where oi O and pj P, each pj connects tuple (ok, ol), k l

Abstraction used to model…

CS717 Multiprocessor Network w/ Generic

Topology• Network

– N: Nodes– L: Links between nodes

– G = (N, L), where ni N and lj L, each lj connects tuple (nk, nl), k l

• Objective Node• Search path Link

CS717

Abstract Routing Model

• Search : (os, ot): S x S S*, where S = (O, P) and S* =

(O*, P*)– ox,oy O and ox,oy O* Successful search– ox,oy O and ox O*, oy O* Unsuccessful

• Routing attempt R:– R(ns, nd): G x G G*, where G = (N, L) and G* =

(N*, L*)– ni,nj N and ni,nj N* Complete route– ni,nj N and ni N*, nj N* Incomplete

CS717

Routing Analogy

• AI search equivalent to routing attempt

• Successful search Route between source and destination nodes

• Unsuccessful search Incomplete route to destination

CS717

Caveats of Analogy

• No specific search algorithm No routing strategy

• No optimality constraints• Nothing about deadlocks/livelocks

• Nothing about fault tolerance!!

CS717

Fault-Tolerant Routing Model

• Model considers two aspects:– Routing system configuration

• Must be generic enough!

– Message propagation protocols and policies

• Following slides introduce what is needed for AI searches (w/ physical message backtracking)

CS717

FT Routing Model (cont.)

CS717

FT Routing Model (cont.)

• Eager readership of input messages• Single input buffer to avoid polling• Multiple output buffers to accommodate different

delivery rates• Router process:

– AI/FT routing strategy implemented here– Physical message backtracking Increased message sizes– Increased message sizes/overhead Requires

communications router at each node

CS717

Communications Router

CS717

Communications Router (cont.)

• Communication router constitutes router process and connections

• Main components: LCM and CP• ROM: Stores link management and routing

software• RAM: Stores routing table, link status table,

associated link lists

CS717

CR Data Structure: Routing Table

CS717

CR Routing Table

• For each node, up to n links• For each link:

– Connected with status OK and node ID of neighbor

– Not connected with status NC and node ID –1

• Link fault represented by timeout:– Status reset to NC

• Processor fault represented by timeouts in neighbors

CS717CR Data Structures: Link Status Table,

Lists

CS717

Message Packets

• Six fields:– Router Control (4 bits): Type of message,

including NORMAL and BACKTRACK– Destination Node ID (10 bits): Supports network

of size up to 1024 nodes– Pending Nodes (20 bytes): Stack of node IDs that

may receive packet but have not yet– Traversed Nodes (20 bytes): Stack of nodes

traversed, with most recent on top

CS717

Message Packets (cont.)

– Traversed Nodes Index (10 bits): Index to previous traversed nodes field. Supports simulation of physical message backtracking

– Data Field (n-bit pointer): Points to information content of packet

CS717

(Finally) AI Search Strategies

• Brute Force:– Depth-First Search– Random Climbing

• Heuristic:– Hill Climbing– Best-First Search– A*

CS717

AI Search Strategies (cont.)

• In presence of network faults:– Prevent cycles No deadlocks– Prevent more than two traversals of nodes/links

No livelocks and necessary for AI searches

• Adaptations of search algorithms• Problems:

– Recursion? Nope (PMB)– Overhead? Fixed (Well, mostly…)

CS717

Common Beginning

Extracts header and disassembles itIF Destination Node is reached, pass packet to host processorELSE

IF Router Control is BACKTRACKIF Pending Nodes top node is directly linked Route packet to that node Set Router Control to NORMALELSE Backtrack packet to previous node in traversed

Pop current node ID from Pending NodesPush current node ID onto Traversed Nodes

CS717

Depth-First Search

• Travel as far as possible– Do not consider alternative paths just yet

• If fault or dead-end, backtrack to most recent possible path

CS717

DFS (cont.)

Following common beginning:

Look for directly linked successor nodes IF they are already traversed, ignore ELSE IF they are in Pending Nodes, ignore ELSE push them onto Pending NodesRead top node of Pending NodesIF directly linked (no fault), route packet to itELSE Set BACKTRACK and route to last traversed nodeEND

CS717

DFS Example

CS717

DFS Example (cont.)

CS717

Random Climbing

Following the common beginning:

ELSE

Select a successor node randomly

Push unselected successor nodes onto Pending Nodes

CS717

Hill Climbing

• Heuristic: Estimated remaining distance

Following common beginning:

ELSE

Sort successor nodes according to est. remaining distance

Push sorted nodes onto Pending Nodes

CS717

Best-First Search

• Resumes partial routes not previously considered

• Looks at immediate neighbors, neighbors of predecessors– Sorts by est. remaining distance

• Leads to non-minimal routes!

CS717

BFS (cont.)

ELSE

Push (directly linked successor nodes) onto Pending Nodes

Sort Pending Nodes according to est. remaining distance

CS717

A*

• Two heuristics:– Estimated remaining distance: h– Path length traversed: g

• Partial paths sorted by f = g + h

• When no faults, always finds minimal route

CS717

A* (cont.)

After current ID processing:

Record path length traversed, g

ELSE

Calculate and store f for new successor nodes

Push them onto Pending Nodes sorted by f

CS717

Performance Testing

• Simulated 125-node multiprocessor network• Max 8 links per node (maps to many

topologies)• Faulty links and processors

– Pre-specified or dynamically generated

• Testing:– Messages between every pair of nodes– 20 trials at 0%, 5%, 10%, 15%, 20% faulty links– 125 x 125 x 20 x 6 = 1,875,000 tests (??)

CS717

Test Results

• As faults increase, heuristic strategies fair better (esp. > 15%)

• A* best search technique but slow

• Hill climbing and BFS do not consider nodes traversed– Hill climbing considers only immediate neighbors

CS717

Test Results (cont.)

CS717

Main Point

Using AI search techniques, we abstract from routing in networks to searching in trees (topology-independent, quantity and type of faults irrelevant)

CS717

Next Paper

• [1] Loh, Peter K.K., “Artificial Intelligence Search Techniques as Fault-Tolerant Routing Strategies”

• [2] Loh, Shaw., “A Genetic-Based Fault-Tolerant Routing Strategy for Multiprocessor Networks”

CS717

Our Little Problem…

• AI search techniques topology- and fault-type independent…

• …but non-minimal routes utilized

• Follow-up work shows how genetic algorithms (combined with heuristics) can find minimal routes in presence of network faults

CS717

Genetic Algorithms: Overview

• Optimization strategy• Population of potential solutions evolve over

series of generations• Each element of population is chromosome;

each unit of chromosome is gene• Chromosomes undergo crossover and

mutation• Most fit chromosomes selected for next

generation, based upon fitness function

CS717

Abstract Model

• Same as before (including definitions of S and G)

• Pure abstraction suffers from same caveats as before

• Basic idea: Instead of AI search for adaptive route, optimize over population of routes to find best

CS717

Message Packets

• Simplified version:

CS717

Chromosome

• Route Chromosome• Node on route Gene in chromosome

• Length of route Size of chromosome– Chromosome size directly reflects routing

performance!

• Distance traversed basis of fitness

CS717

Population Creation

CS717

Mutation and Crossover

• Mutation: Swap and/or shift• Normal crossover destroys routes, messes

with source and destination; problem w/ different lengths– Use one-point random crossover

CS717

Fitness Function

• F = (Dmax – Droute) / Dmax + – Dmax: Maximum distance between source and

destination

– Droute: Distance traveled by specific route

: Predefined value to ensure non-zero fitness

• Higher value More fit

CS717

Selection Scheme

• Roulette Wheel– Sum of fitness values * random value from [0,1]– Select chromosomes with fitness greater than product

• Tournament Selection– Most fit chromosomes selected

• Stochastic Remainder– Probabilities used to select route

• Which scheme has best performance selecting optimal route?

CS717

Reroute

CS717

Genetic Hybrid Algorithm