[ieee 2004 12th ieee international conference on networks (icon 2004) - singapore (16-19 nov. 2004)]...

Reinforcement Learning and CMAC-based Adaptive Routing for MANETs

David Chetret, Chen-Khong Tham Computer Communication Networks Laboratory

Dept of Electrical and Computer Engineering (ECE) National University of Singapore

3 Engineering Drive 3, Singapore 117576 Email: {g0203504, eletck} @nus.edu.sg

Abstract-A novel routing scheme, which combines the on- demand muting capability of Ad Hoc On-Demand Vector (AODV) routing protocol with a Q-muting inspired mute selection mechanism, is proposed in this paper. The scheme makes routing choices based on local information (such as mobility, power remaining at the neighbouring nodes) and past experience. The CMAC (Cerebellar Model Articulation Controller) function approximator is used to accelerate the reinforcement learning. Thmugh extensive simulation, we demonstrate that our scheme is effective in improving end-to-end delay, without requiring much of the limited network resources.

I. INTRODUCTION

A Mobile Ad Hoc Network (MANET) [ I ] is a network of wireless mobile nodes that does not rely on any base stations or fixed infrastructure. Communications are made using multihop wireless links. Therefore each node also acts as a router, forwarding data packets for other nodes. Low-cost mobile devices such as laptops and palmtops are becoming widely available, making large scale ad hoc networks in the civil field a realistic possibility.

Dynamism, absence of costly and often cumbersome infrastructure, decentralization and robustness are the main advantages of MANETs. This makes them particularily adapted to disaster recovery situations, interactive courses or meetings. Future application possibilities are endless, from sensor dust networks to multihop wireless broadband Internet access, wireless imaging, cooperative car traffic monitoring, Personal Area Networks, etc. The difficulties inherent in this type of network come from the limited computing resources and bandwidth, as well as the potential high degree of node mobility and energy limitations. Therefore, ad hoc routing protocols must he efficient, with a low load on the network and at the same time he accurate in their decision-making mechanism.

Current routing protocols for MANETS suffer from several shortcomings [21, [31. Proactive protocols, such as Destination Srquenced Distance Vector (DSDV) [4], consume a large portion of the already scarce network capacity. This is because they exchange extensive routing table information, which represents large chunks of information, thus generating important routing overhead. The on-demand routing protocols like

Lawrence W. C. Wong Institute for Infocomm Research

21 Heng Mui Keng Terrace, Singapore 119613 Email: [email protected]

Dynamic Source Routing (DSR) [ 5 ] launch route discovery every time a data packet needs to he sent, requiring the actual communication to he delayed until the route is determined. This may not he suitable for real-time data and multimedia communication applications.

Ad Hoc On Demand Vector (AODV) [6] routing is also a reactive protocol, but it uses controlled flooding for its route discovery. Therefore, i t is more scalable than DSDV or DSR. However, it only keeps one route in memory for a given destination, and if a route to the destination is available and up, no alternative route found will he taken into consideration until the current one breaks.

Our objective is to modify AODV to include alternative routes and a route selection process based on local conditions, thus making it more dynamic and reactive to local condition changes and past experience.

11. AODV ROUTING PROTOCOL

AODV is a reactive protocol, using route discovery and route maintenance mechanisms much like DSR. However. nodes maintain routing tables as in DSDV.

When a node does not have a valid unexpired route to a destination it wishes to send a packet to, it initiates a route discovery to locate the destination node. This is ,done by broadcasting a Route Request (RREQ) packet to all of its neighbours, which then forward it to their own neighbours, and so on, until either the destination is reached, or an intermediate node with a “fresh enough” route to destination is found. If there is no other route, the route is added directly to the routing table of the node. Otherwise, if route a is the newly discovered route and route b is the current route used, in order to he added to the routing table route a must verify the following:

(seq, > seqb) V (seq, = seqbA hop.count, < hopxountb) where seq is the sequence number and hop-count is the number of hops from source to destination.

If a nodc receives multiple RREQ packets with same origin- destination pair, i t discards the additional copies, Thc node then responds by sending a Route Reply ( R E P ) accross the reverse path to the source. Likewise, if multiple copies of

0-7503-8781-x/04/$20.00 0 2004 IEEE 540

RREP are received, only the first one to arrive is taken into account.

The use of sequence numbers guarantees that routes are loop free, since the routes used are always updated together. Periodic HELLO beacon packet broadcasts are made to maintain information about local connectivity at each node. If a node along the route moves, its upstream neighbors notice the move and propagate a link failure notification or route error message (RERR) to each of their active upstream neighbors to inform them of the removal of that part of the route.

The main advantage of AODV is that it uses the expand- ing ring search technique for route discovery. This prevents unnecessary broadcasts of requests since increasingly larger neighborhoods are searched to find the destination. The search is controlled by the time-to-live ('ITL) field in the IP header of the request packets. Furthermore, to increase efficiency, if the route to a previously known destination is needed, the prior hop-wise distance is used to accelerate the search.

Route maintenance is conducted in the following manner: each time a route is used to forward a data packet, its route expiry time is incremented. A routing table entry is invalidated if it is not used within such an expiry time. AODV also has the following feature: it maintains an active neighbor node list, enabling each node to keep track of the neighbours forwarding packets to it. classified according to the route entry it uses to route data packets. These nodes will be notified with route error (RERR) packets when the link to the next hop node is broken. In tum, each one of these neighbour node will forward the RERR packet to its own list of active neighbours, thus invalidating all the routes using the broken link.

111. PROBLEM DESCRIPTION A N D APPROACH

As mentioned earlier, the path chosen by AODV to forward a packet is only dependent on the number of hops and the route sequence numbers. However, parameters such as congestion at neighbour nodes, neighbour connectivity, mobility, quality of signal, distance, and remaining energy at each node will have a greater impact on the chances of success of a communication with respect to particular QoS bounds.

Therefore, i t seems important to learn a policy, which determines the forwarding action to take depending on the local state of the network and the information received from neighbourhood nodes.

We will explain in section lV the mathematical theory behind the intuitive concepts we introduce here.Reinforcementlearnzng is based on state, action, and reward. We deline the parameters as follows: . State: we consider the best neighborhood values for

buffers, connectivity, mobility, remaining energy, and Signal-to-Noise Ratio (SNR, characterized by the power received at the node from the given neighbor). . Action: in adaptive path routing the action is the path selection. To make the learning more general we dcline

the action as a choice between the best nodes relative to each value, e.g.:

I ) action 1: send through the path such that the next hop node has best value for buffers.

2) action 2: send through the path such that the next hop node has best value for connectivity, etc.

This action list is independent of the neighborhood nodes, i.e. the actual node addresses for each best value are unknown to the learoing algorithm, therefore the learning is done independently of the current topology. This is a very general formulation of the problem considered. Reward: it should reflect how well the QoS is satisfied for the node. We choose to use the end-to-end delay QoS metric as reward.

I v . MATHEMATICAL BACKGROUND

A. Q-muting

Q-routing was introduced by Boyan and Littman in 1993 [SI. The algorithm can be summarized in the two following steps:

I ) Sending of a packet P ( d , s ) , originated by node s, forwarded by node x, and with destination d, denoted as: sendPacket,(P(d, s))

2) Reception of a packet P ( d , s) originated by node s, forwarded by node x , with destination d , at node y. a neighbour of node z, denoted by recwPackety(z, P ( d > s ) )

where s is the source node of the packet. Let Q,(d, y) be an estimation of the time for a packet to

go from node z to final destination d through neighbouring node y t N(z ) , arg(Q,(d,y)) a function returning the y of Q,(d, y), mi.(&) a function returning the smallest element of &.

Sending procedure sendPacket,(P(d, s)): I) Packet P(d , s ) at node z 2) x calculates the best neighbour to send through to reach

3) z sends P(d> s) to neighbour y d: Y = aTgminyEN(,)(Q,(d,~))

a) z waits for the estimate of Q y ( d r i ) where i E

b) z receives the estimate Qy(d, i) c) z updates Q5(d,Y) according to equation 1 d) z is ready to send another packet ( Go to step (1))

Reception procedure recvPacket,(z, P(d; s)) 1) Packet P(d: s) is received by node y 2) y calculates the best time to reach d : QV(d,z ' ) =

3) y sends estimation Qy(dr 2') in return to node z 4) If current node y corresponds to destination d (in routing

minzE ,~ (y )Qy(d 2)

table), done. (Go to step ( I ) ) Else sendPncket,(P(d,s)) (Go to step ( I ) )

Now let us explain how the Q-values are updated and stored.

541

E. Reinforcement Learning Reinforcement leaming (RL) [9], [IO] is a form of machine

learning in which the learning agent has to formulate a policy which determines the appropriate action to take in each state in order to maximize the expected cumulative reward over time. An effective way to achieve reinforcement learning is to use the Q-learning algorithm in which the value of state-action pairs are maintained and updated over time in the following manner:

Qdz, a ) + V t k t + y Q t ( ~ t , I t ) - Q t ( x , a ) ] , z = zt and a = at ( I )

where: yt is the next state when action at is taken in state xt, &(yt) = maxlcAcy,,Qt(yt,l), A ( y t ) is the set of actions available in state yt and T~ is the immediate reinforcement that evaluates the last action and state transition.

{ Qt(z , a ) , otherwise Q t f l ( Z > a ) =

The y term discounts the observed Q-value from the next state to give more weight to states which are near in time since they are more responsible for the observed outcome, while qt is a learning rate parameter that affects the convergence rate of the Q-values in the face of stochastic state transitions and rewards.

The action a in each state xt is selected according to the Boltzmann probability distribution:

A B B O c

0 I- I .-

point of interest. The output value is the sum of contributions from the active receptive fields.

We will now explain how the decimal state values can be stored into integer CMAC values without losing too much precision. Let I be the state vector (N dimensions). Let K be the number of quantizing functions per input dimension, and (Z?)jE{l,.,,,N) the number of resolution elements for each dimension.

Two mappings are performed by the CMAC: . A non linear mapping, from the original N dimension vector ? to a binary vector 2 with higher dimensionality

Z,. This is the generalization (or quantization) step, j=1 as shown in figure I . It associates one cell per quantization function to the initial input vector.

N

Fig. I. Here K=3 and a particular dimension j is represented. Z . ic 15. A and B will have same binary vector representation for this dimen&: aclive cells 2, 7, 12. whereas C will be represented by active cells 2, 8. I?.

E 4 z t )

where Alz) is the set of available actions at state zt and 4 is a ~, parameter which determines the probability of selecting non- greedy actions. This mechanism enables us to perform both exploration and exploitation.

C. Cerebellar Model Articulation Controller A CMAC is a particular type of neural network. In our

system, it is used to store the &-values. Neural networks do not store exact values, hut try to reconstmct them when required. The advantage of the neural network approach is that it is a storage unit with constant memory occupancy. Many neural networks perform non-linear transformations of an input vector, but CMAC has the advantage of representing relatively smaller values with greater precision than the relatively higher ones. In the case of Q-routing, this is particularly well adapted since we are trying to minimize delay and will need to separate smaller values, while large values should not use up too much storage space.

The principle of the CMAC is as Collows: each point in space stimulates a set of overlapping and offset receptive fields in each dimension. Receptive functions associate the input to the rcceptive fields. Combining the excited receptive fields of the corresponding rcceptive functions yields a hypercube. The overlap of all these hypercubes is a hypercube containing the

. A linear mapping, where ZT is multiplied by the current value 5 of the vector of the weights of each cell (for a given dimension). This is the memory reduction step.

Therefore, combining these two steps, for a given dimen-

. I th quantizing function ( I E 1, ..., K ) for dimension i :

Linear mapping:

sion, the mappings within the CMAC are as follows:

ffi = f t (szl l :K) =i7lt(c + k) K

y =

in vector terms yi = w r x

W ( ~ ~ ( S ~ , L , K ) > f2(sz, L? K ) , .._: f ,v(sN:l> K ) ) or 1=1

Another way to see this is to write:

where the indesl are the indexes of the active weights, given by:

N t-1

index1 = , ~ , ( S ~ , L , K ) +lf i ( s i ,~ :~ i )n~j ;=2 ,=l

0-7803-8783-X/04/$20.00 0 2004 IEEE 542

This can he generalised to the case where the output is a vector, each dimension of which will he determined by a different CMAC with its own quantization functions and number of cells (determining the resolution).

Finally, the training procedure is done by updating the weights according to the least-mean square (LMS) algorithm.

v. IMPLEMENTATION AND RESULTS

A. Implementation Details

:ig. 2. (a) HELLO packets are modified to give regular state updates to ieighbounng nodes. The action selection is based on there state YBIUCS. (b) lREP packets are modified to include a field containing the time h e RREQ

.lacket took to arrive at the destination, QREP packets are introduced as one hop acknowledgements containing time taken to next hop and estimated time from next hop to debtination. obtained by taking the minimum of Q values for path selections to destination

HELLO packets were modified to carry the values of available buffers, connectivity and remaining energy at originating node in their headers. We measured received power of the HELLO packet at the target node and the mobility can be deduced from simply the consecutive received power difference (e.g. i f this difference is positive then the neighbour node is moving closer). This ensures the

state information is regularly updated.

In order to propagate the rewards, we modified the protocol signaling as follows: a new QREP packet type was introduced in the protocol, corresponding to a local acknowledgement. This packet includes the following information in its header: . time the packet acknowleged by this QREP packet took

to get to the next hop the current estimation of delay from the next hop to the destination (based on the Q-values stored at the next hop).

This estimation is obtained by taking the minimum of available Q-values for all actions (path selections) at the next hop. The new signaling mechanism is illustrated in Figure 2.

B. Simulation Results

We performed discrete event simulations under ns-2 [ 121 of an ad hoc network of 50 wireless mobile nodes, with specified movement and communication scenarios. We used a high mobility scenario with little pause time in order to test the system under difficult conditions. Our protocol evaluations are based on a rectangular (1 500m x 300111) flat space for 900s of simulated time. This is because we want to force the use of longer routes (in terms of number of hops) where possible.

In order to see the progress of the learning throughout the simulation, we made measurements of the QoS metrics chosen (average end-to-end delay, packet delivery ratio, energy efficiency) every 100s.

The results obtained are graphed in Figure 3. . End-to-end delay: we observed an improvement of more than 20% in the long run, which is expected since the learning rewards are based on delay. . Packet delivery ratio: due to the more complex protocol signaling we observed a small decrease in packet delivery ratio, from about 35% to 31%. In the context of real-time multimedia packet delivery, this small loss is compensated by the better average end-to-end delay. . Energy efficiency: in terms of energy used per packet delivered, our proposed =-based protocol was slightly less efficient than the original.

VI. CONCLUSION

We have introduced an innovative and general approach to MANET routing. The statelactionireward framework can he changed in a transparent way with respect to the learner. For example, the reward function can he changed in order to take into account the throughput if it is considered an important QoS parameter. Simulations have been presented in order to show the effectiveness of the scheme in reducing end-to-end delay without significantly affecting throughput or protocol energy efticiency.

In future work, we plan to reduce the overhead of the protocol signalling and to refine our reward function in order to optimize more QoS parameters.

543

I End-toend delay

2 5 ~

- 2 - - 4 1 5 - n

1 -

_. , I - ,.--* _ _ L̂ -. .,' /

:'/ / I .i

I .$ i

100 200 300 400 500 600 700 800 900

Packet Delivery ratio

_______.._. 70 ' 65

-+-mginal AOD ........ MfiRL

40 35

30 -..... 4

100 200 300 400 500 600 700 ROO 900

Simulation time Is)

Energy efnciency

~~ . . . ................... ..... .......,

k 065

t 015 Y

0.05 100 200 300 400 500 600 700 800 900

Slmuldontime I*

Fig. 3. compensated by a 20% improvement in end-to-end delay

Simulation results: the small dmp in packet delivery ratio is

REFERENCES

[I ] C. E. Perkins, Ad Hoc Networking, Addison Wesley Pmfesrional. 2001 [?I Broch. 1.: Maltz, D. A.: Johnson. D. B.: Hu. Y:C. and Jrtcheva. 1.: "A

Pelformance Comparison of Multi-Hnp Wireless Ad Hac Network Rnut- ins Protocols". Pmceedines of the 4th Annual ACMIIEEF: lntemational

[31 Perkins, C. E.: Royer, E. M.: Das, S . R. and Manna, M. K.: "Perfar- mance Comparison of Two On-demand Routing Pmtocols for Ad Hoc Networks", IEEE Personal Communications Magazine Special Issue on Mobile Ad Hac Networks, Feb 2001, Vol. 8. No. I. pages 16-29

[41 C. E. Perkins and P. Bhagwaf Highly dynamic Destination Sequenced Distance-Vector muting (DSDV) for mobile camputen, in P m . of the SIGCOMM 94 Conf. on Communications Architecture, P m t ~ ~ o l s and Applications, pp. 234-244, Aug. 1994.

[SI D. B. Johnson and D. A. Male. "Dynamic Source Routing in ad hoc wireless networks:' Mobile Computing, edited by Tomasz lmielinski and Hank Korth, chapter 5. pp. 153.181, Kluwer Academic Publishers, 1996

[6] C. E. Perkins. E. M. Royer and S. R. Das. Ad Hoc On-Demand Distance Vector (AODV) Routing. in pmc. IEEE? Workshop on Mobile Computing Systems and Applications. pp. 90-100, Feb. 1999.

[7] F. Yu. V. Wong, and V. Leung, "A New QoS Provisioning Method for Adaptive Multimedia in Cellular Wireless Networks:' in Pmc. of IEEE Jnfocam'M, Hong Kong. China. March 2004

[8] Justin A. Boyan and Michael L. Litman. Packet muting in dynaniically changing networks : A reinforcement leaming approach. in Advances in Neural Information Pmcessing Systems, volume 6. pages 671678. http ://www.cs.duke.edu/ mlitunan/topics/routing-page.ht~, 1993.

[91 R. S. Sutton and A. G . Barto, Reinfoeemen1 Learning: An Introduction. MIT Press, Cambtidge. MA, 1998

[IO] D. P. Bewekas. 1. Tsitsiklis, Neuwdynamic Programming, Amena Scientific. 1996

[ I l l C.K. Tham. Modular On-line Function approximation for scaling up reinforcement leaming, PhD thesis, University of Cambridge, 1994

[I21 S. McCanne and S.Floyd, ns2 - The Network Simulator, available fmm http://www.isi,edulnanamln*.

[I31 A. S. Tanenbaum, Computer network^. Fourth Edition. Prentice Hall. 2003

[I41 D. P. Bertsekas, Dam Networks, Second Edition, Prentice Hall, 1992

0-7803-87X3-X/04/$20.00 0 2004 IEEE 544

[ieee 2004 12th ieee international conference on networks (icon 2004) - singapore (16-19 nov. 2004)]...

Documents