uav route planning in delay tolerant networks

UAV Route Planning in Delay Tolerant Networks

Daniel Henkel, Timothy X Brown

University of Colorado, Boulder

Infotech @ Aerospace ‘07

May 8, 2007

Familiar: Dial-A-Ride

Receive calls Pick up and drop

off passengers Minimize overall

transit time

The Bus

Dial-A-Ride: curb-to-curb, shared ride transportation service

Optimal route not trivial !

In context: Dial-A-UAV

Sparsely distributed sensors, limited radios TSP solution not optimal Our approach: Queueing and MDP theory

Sensor-1

Sensor-2

Sensor-4

MonitoringStation

Delay tolerant traffic!

Sensor-5Sensor-3

Sensor-6

Complication: infinite data at sensors; potentially two-way traffic

Talk tomorrow – 8am:Sensor Data Collection

TSP’s Problem

• One cycle visits every node

• Problem: far-away nodes with little data to sendVisit them less often

Traveling Salesman SolutionA Bhub

fA fB

UAV

dA dB

New: cycle defined by visit frequencies pi

pA pB

B

B

Queueing Approach

Idea: express delay in terms of pi, then minimize over set {pi}

• pi as probability distribution

• Expected service time of any packet

• Inter-service time: exponential distribution with mean Ti/pi

• Weighted delay:

A

Chub

fA

fC

UAV

dAdB

pA pB

B

fB

D

fD

i

ip 1

pC

pD

dD

dC

i

ii pT

GoalMinimize average delay

0ip

i j i

ijj

Fp

fpT

Solution and Algorithm

Probability of choosing node i for next visit:

jjj

iii

f

fp

/

/

Implementation: deterministic algorithm1. Set ci = 02. ci = ci + pi while max{ci} < 13. k = argmax {ci}4. Visit node k; ck = ck-15. Go to 2.

Performance improvement over TSP!

Unknown Environment

• What is RL?• Learning what to do without prior training• Given: high-level goal; NOT: how to reach it• Improving actions on the go

• Distinguishing Features:• Interaction with environment• Trial & Error Search• Concept of Rewards & Punishments

• Example: training dog

Learns model of environment.

The Framework

Agent• Performs

Actions

Environment• Gives rise to

Rewards• Puts Agent in

situations called States

Elements of RL

Policy

RewardValue

Model ofEnvironment

• Policy: what to do (depending on state)• Reward: what is good• Value: what is good because it predicts reward• Model: what follows what

Source: Sutton, Barto, Reinforcement Learning – An Introduction, MIT Press, 1998

UA Path Planning - Simple

• Service traffic from A and B to hub H

• Goal: minimize average packet delay• State: traffic waiting at nodes: (tA, tB)

• Actions: fly to A; fly to B• Reward: # packets delivered

• Optimal policy: # visits to A and B; depend on flow rates, distances

A Bhub

fA fB

UAV

dA dB

pA pB

GoalMinimize average delay

-> Find pA and pB

MDP

• If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).

• If state and action sets are finite, it is a finite MDP.• To define a finite MDP, you need to give:

• state and action sets• one-step “dynamics” defined by transition probabilities:

• reward expectation:

).(,, allfor ,Pr 1 sAaSssaassss tttass P

Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).

• Policy: Mapping from set of States to set of Actions

π : S → A • Sum of Rewards (:=return): from this time onwards

• Value function (of a state): Expected return when starting with s and following policy π. For an MDP,

RL approach to solving MDPs

Bellman Equation for Policy π

• Evaluating E{.}; assuming deterministic policy; π solution:

• Action-Value Function: Value of taking action a in state s. For an MDP,

• V and Q, both have a partial ordering on them since they are real valued. π also ordered:

• Concept of V* and Q*:

• Concept of π*: The policy π which maximizes Qπ(s,a) for all states s.

Optimality

if and only if V (s) V (s) for all s S

V (s) max

V (s) for all s S

Q(s,a) max

Q (s, a) for all s S and a A(s)

(s) arg maxaA (s)

Q(s,a)

Reinforcement Learning - Methods

• To find π*, all methods try to evaluate V/Q value functions

• Different Approaches:• Dynamic Programming Approach

• Policy evaluation, improvement, iteration

• Monte-Carlo Methods• Decisions are taken based on averaging sample

returns

• Temporal Difference Methods (!!)

uav route planning in delay tolerant networks

Documents

finite mdp

ci pi

mdp theorysensor

actionvalue function

deterministic policy

express delay

terms of pi

markov decision process