application of reinforcement learning in network routing by chaopin zhu chaopin zhu

Application of Reinforcement Learning in Network Routing

ByBy

Chaopin ZhuChaopin Zhu

Machine Learning

Supervised LearningSupervised Learning Unsupervised LearningUnsupervised Learning Reinforcement LearningReinforcement Learning

Supervised Learning

Feature: Learning with a teacherFeature: Learning with a teacher PhasesPhases

• Training phaseTraining phase

• Testing phaseTesting phase ApplicationApplication

• Pattern recognitionPattern recognition

• Function approximationFunction approximation

Unsupervised Leaning

FeatureFeature

• Learning without a teacherLearning without a teacher ApplicationApplication

• Feature extractionFeature extraction

• Other preprocessingOther preprocessing

Reinforcement Learning

Feature: Learning with a criticFeature: Learning with a critic ApplicationApplication

• OptimizationOptimization

• Function approximationFunction approximation

Elements ofReinforcement Learning

AgentAgent EnvironmentEnvironment PolicyPolicy Reward functionReward function Value functionValue function Model of environment (optional)Model of environment (optional)

Reinforcement Learning Problem

Environment

Agent

action an

reward rn

state xn

xn+1

Rn+1

Markov Decision Process (MDP)

Definition:Definition:

A reinforcement learning task that satisfies A reinforcement learning task that satisfies the Markov propertythe Markov property

Transition probabilitiesTransition probabilities

nnnayx axyxP n

n,|Pr 1

An Example of MDP

High Low

search wait

recharge

search wait

Markov Decision Process (cont.)

ParametersParameters

Value functionsValue functionspolicy

discount

rewardr

actiona

statex

n

n

n

0

0

,|,

|

knnkn

knn

knkn

k

axrEaxQ

xxrExV

Elementary Methods forReinforcement Learning Problem

Dynamic programmingDynamic programming Monte Carlo MethodsMonte Carlo Methods Temporal-Difference LearningTemporal-Difference Learning

Bellman’s Equations

aaxxaxQrEaxQ

aaxxxVrExV

nnna

n

nnnna

,|,max,

,|max

'1

*1

*

1*

1*

'

Dynamic Programming Methods

Policy evaluationPolicy evaluation

Policy improvementPolicy improvement

xxxVrExV nnknk |111

aaxxxVrEaxQ

axQx

nnnn

a

,|,

,maxarg'

11

Dynamic Programming (cont.)

E ---- policy evaluationE ---- policy evaluation

I ---- policy improvementI ---- policy improvement Policy IterationPolicy Iteration Value IterationValue Iteration

**210

10 VVVEIEIEIE

Monte Carlo Methods

FeatureFeature• Learning from experienceLearning from experience• Do not need complete transition Do not need complete transition

probabilitiesprobabilities IdeaIdea

• Partition experience into episodesPartition experience into episodes• Average sample returnAverage sample return• Update at episode-by-episode baseUpdate at episode-by-episode base

Temporal-Difference Learning

Features Features (Combination of Monte Carlo and DP ideas)(Combination of Monte Carlo and DP ideas)

• Learn from experience (Monte Learn from experience (Monte Carlo)Carlo)

• Update estimates based in part on Update estimates based in part on other learned estimates (DP)other learned estimates (DP)

TD(TD() algorithm seemlessly integrates TD ) algorithm seemlessly integrates TD and Monte Carlo Methodsand Monte Carlo Methods

TD(0) LearningInitialize V(x) arbitrarilyInitialize V(x) arbitrarily to the policy to be evaluatedto the policy to be evaluatedRepeat (for each episode):Repeat (for each episode):

Initialize xInitialize xRepeat (for each step of episode)Repeat (for each step of episode)

aaaction given by action given by for x for xTake action a; observe reward r and next state x’Take action a; observe reward r and next state x’

xxx’x’until x is terminaluntil x is terminal

)]()'([)()( xVxVrxVxV

Q-Learning

Initialize Q(x,a) arbitrarilyInitialize Q(x,a) arbitrarilyRepeat (for each episode)Repeat (for each episode)

Initialize xInitialize xRepeat (for each step of episode):Repeat (for each step of episode):

Choose a from x using policy derived from QChoose a from x using policy derived from QTake action a, observe r, x’Take action a, observe r, x’

xxx’x’until x is terminaluntil x is terminal

)],(),(max[),(),( ''

'axQaxQraxQaxQ

a

Q-Routing

QQxx(y,d)----estimated time that a packet would (y,d)----estimated time that a packet would

take to reach the destination node d from take to reach the destination node d from current node x via x’s neighbor node ycurrent node x via x’s neighbor node y

TTyy(d) ------y’s estimate for the time remaining (d) ------y’s estimate for the time remaining

in the tripin the trip

qqyy ---------queuing time in node y ---------queuing time in node y

TTxyxy --------transmission time between x and y --------transmission time between x and y

dzQdT yyNz

y ,min)()(

Algorithm of Q-Routing

1.1. Set initial Q-values for each nodeSet initial Q-values for each node2.2. Get the first packet from the packet queue of Get the first packet from the packet queue of

node xnode x3.3. Choose the best neighbor node and forward Choose the best neighbor node and forward

the packet to node bythe packet to node by4.4. Get the estimated valueGet the estimated value from node from node5.5. Update Update 6.6. Go to 2.Go to 2.

y

y dyQy xxNy

,minargˆ)(

yy qdT ˆˆ

dyQdTqtdyQdyQ xyyyxxx ,ˆ,ˆ,ˆ ˆˆˆ

Dual Reinforcement Q-Routing

s

x y

d

Qx(y,d) Qy(x,s) Packet

Backward Exploration

Forward Exploration

Network Model

Subnet1 Subnet2

Network Model (cont.)

16

13

10

7

4

1

17

14

11

8

5

2

18

15

12

9

6

3

19

22

25

28

31

34

20

23

26

29

32

35

21

24

27

30

33

36

Node Model

Packet Generator

Packet Destroyer

Routing Controller

Input Queue

Input Queue

Input Queue

Input Queue

Output Queue

Output Queue

Output Queue

Output Queue

Routing Controller

Init

Idle

Arrival Departure

Arrival Depart

Default

Initialization/ Termination Procedures

InitilizationInitilization Initialize and / or register global variableInitialize and / or register global variable Initialize routing tableInitialize routing table

TerminationTermination Destroy routing tableDestroy routing table Release memoryRelease memory

Arrival Procedure

Data packet arrivalData packet arrival Update routing tableUpdate routing table Route it with control information or Route it with control information or

destroy the packet if it reaches the destroy the packet if it reaches the destinationdestination

Control information packet arrivalControl information packet arrival Update routing tableUpdate routing table Destroy the packetDestroy the packet

Departure Procedure

Set all fields of the packetSet all fields of the packet Get a shortest routeGet a shortest route Send the packet according to the routeSend the packet according to the route

References

[1] Richard S. Sutton and Andrew G. Barto, [1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning—An IntroductionReinforcement Learning—An Introduction

[2] Chengan Guo, Applications of [2] Chengan Guo, Applications of Reinforcement Learning in Sequence Reinforcement Learning in Sequence Detection and Network RoutingDetection and Network Routing

[3] Simon Haykin, Neural Networks– A [3] Simon Haykin, Neural Networks– A Comprehensive FoundationComprehensive Foundation

application of reinforcement learning in network routing by chaopin zhu chaopin zhu

Documents

reinforcement learning

td0 learning

supervised learning

reinforcement learning

terminal slide

episode base slide

example of mdp slide

state x x x