motion planning under uncertainty for robotic tasks with

19
Motion Planning under Uncertainty for Robotic Tasks with Long Time Horizons Hanna Kurniawati Singapore–MIT Alliance for Research Technology Singapore 117543, Singapore [email protected] Yanzhu Du Department of Computer Science, Stanford University Stanford, CA 94305, USA [email protected] David Hsu Department of Computer Science, National University of Singapore Singapore 117590, Singapore [email protected] Wee Sun Lee Department of Computer Science, National University of Singapore Singapore 117590, Singapore [email protected] Abstract Motion planning with imperfect state information is a crucial capability for autonomous robots to operate reliably in uncertain and dynamic environments. Partially observable Markov decision processes (POMDPs) provide a principled general framework for planning under uncertainty. Using probabilistic sampling, point-based POMDP solvers have drastically improved the speed of POMDP planning, enabling us to handle moderately complex robotic tasks. However, robot motion planning tasks with long time horizons remains a severe obstacle for even the fastest point-based POMDP solvers today. This paper proposes Milestone Guided Sampling (MiGS), a new point-based POMDP solver, which exploits state space information to reduce effective planning horizons. MiGS samples a set of points, called milestones, from a robot’s state space and constructs a simplified representation of the state space from the sampled milestones. It then uses this representation of the state space to guide sampling in the belief space and tries to capture the essential features of the belief space with a small number of sampled points. Preliminary results are very promising. We tested MiGS in simulation on several difficult POMDPs that model distinct robotic tasks with long time horizons in both 2-D and 3-D environments. These POMDPs are impossible to solve with the fastest point-based solvers today, but MiGS solved them in a few minutes. 1 Introduction Motion planning with imperfect state information is a crucial capability for autonomous robots to operate reliably in uncertain and dynamic environ- ments. With imperfect state information, a robot cannot decide the best actions on the basis of a single known state; instead, the best action de- pend on the set of all possible states consistent with the available information, resulting in much higher computational complexity for planning the best actions. Partially observable Markov decision processes (POMDPs) [9, 21] provide a principled general framework for such planning tasks. In a POMDP, we represent a set of possible states as a belief, which is a probability distribution over a 1

Upload: others

Post on 04-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Motion Planning under Uncertainty for Robotic Tasks with Long

Time Horizons

Hanna KurniawatiSingapore–MIT Alliance for Research Technology

Singapore 117543, [email protected]

Yanzhu DuDepartment of Computer Science, Stanford University

Stanford, CA 94305, [email protected]

David HsuDepartment of Computer Science, National University of Singapore

Singapore 117590, [email protected]

Wee Sun LeeDepartment of Computer Science, National University of Singapore

Singapore 117590, [email protected]

Abstract

Motion planning with imperfect state information is a crucial capability for autonomous robotsto operate reliably in uncertain and dynamic environments. Partially observable Markov decisionprocesses (POMDPs) provide a principled general framework for planning under uncertainty. Usingprobabilistic sampling, point-based POMDP solvers have drastically improved the speed of POMDPplanning, enabling us to handle moderately complex robotic tasks. However, robot motion planningtasks with long time horizons remains a severe obstacle for even the fastest point-based POMDP solverstoday. This paper proposes Milestone Guided Sampling (MiGS), a new point-based POMDP solver,which exploits state space information to reduce effective planning horizons. MiGS samples a set ofpoints, called milestones, from a robot’s state space and constructs a simplified representation of thestate space from the sampled milestones. It then uses this representation of the state space to guidesampling in the belief space and tries to capture the essential features of the belief space with a smallnumber of sampled points. Preliminary results are very promising. We tested MiGS in simulation onseveral difficult POMDPs that model distinct robotic tasks with long time horizons in both 2-D and3-D environments. These POMDPs are impossible to solve with the fastest point-based solvers today,but MiGS solved them in a few minutes.

1 Introduction

Motion planning with imperfect state informationis a crucial capability for autonomous robots tooperate reliably in uncertain and dynamic environ-ments. With imperfect state information, a robotcannot decide the best actions on the basis of asingle known state; instead, the best action de-

pend on the set of all possible states consistentwith the available information, resulting in muchhigher computational complexity for planning thebest actions. Partially observable Markov decisionprocesses (POMDPs) [9, 21] provide a principledgeneral framework for such planning tasks. In aPOMDP, we represent a set of possible states asa belief, which is a probability distribution over a

1

robot’s state space. We systematically reason overthe belief space B, the space of all beliefs, by takinginto account uncertainty in robot control, sensormeasurements, and environment changes, in orderto choose the best actions and achieve robust per-formance. By incorporating uncertainty into plan-ning, the POMDP approach has led to improvedperformance in a number of robotic tasks, includ-ing localization, coastal navigation, grasping, andtarget tracking [5, 8, 14, 18].

Despite its solid mathematical foundation,POMDP planning faces two major computationalchallenges. The first one is the “curse of dimen-sionality”: a complex robotic task generates ahigh-dimensional belief space. If a robotic taskis modeled with a discrete state space, its be-lief space has dimensionality equal to the num-ber of states. Thus, a task with 1,000 stateshas a 1,000-dimensional belief space! In recentyears, point-based POMDP solvers [11, 14, 22, 23]have made dramatic progress in overcoming thischallenge by sampling the belief space and com-puting approximate solutions. Today, the fastestpoint-based POMDP solvers, such as HSVI2 [22]and SARSOP [11], can handle moderately com-plex robotic tasks modeled as POMDPs with upto 100,000 states in reasonable time. The successof point-based solvers can be largely attributed toprobabilistic sampling, which allows us to use asmall number of sampled points as an approximaterepresentation of a high-dimensional belief space.The approximate representation substantially re-duces computational complexity. The same reasonunderlies the success of probabilistic sampling inother related problems and approaches, e.g., prob-abilistic roadmap (PRM) algorithms [2] for geo-metric motion planning (without uncertainty).

The second major challenge is the “curse of his-tory”. In a motion planning task, a robot oftenneeds to take many actions to reach the goal, re-sulting in a long planning horizon. The complex-ity of planning often grows exponentially with thehorizon. Together, a long planning horizon and ahigh-dimensional belief space compound the diffi-culty of planning under uncertainty. For this rea-son, even the best point-based POMDP solvers to-day have significant difficulty with robotic tasksrequiring long horizons (see Section 6 for exam-ples).

To overcome this second challenge and scale upPOMDP solvers for realistic robot motion plan-

ning tasks, we have developed a new point-basedPOMDP solver called Milestone Guided Sampling(MiGS). It is known from earlier work on relatedproblems that the most important component of asampling-based planning algorithm is its samplingstrategy [6]. MiGS reduces the planning horizonby constructing a more effective sampling strategy.It samples a set of points, called milestones, froma robot’s state space and constructs a roadmapgraph. Each node of the roadmap correspondsto a sampled milestone, and each edge is labeledwith a sequence of actions that brings the robotfrom one milestone to another. MiGS then usesthe roadmap to guide sampling in the belief space.Using the roadmap as a simplified representationof the state space, MiGS avoids exploring many ofthe similar belief space paths and thus improvescomputational efficiency.

MiGS’ approach to belief space sampling is re-lated to earlier POMDP solvers that use macro-actions [15, 24]. However, the earlier works definemacro-actions as policies. They manually decom-pose the POMDP into several smaller POMDPs,solve the smaller POMDPs, and then use the re-sulting policies as macro-actions to solve the orig-inal POMDP. Instead of constraining the result-ing policy in terms of such pre-defined macro-actions, MiGS automatically generates long actionsequences, which can be regarded as open-looppolicies, and use them to guide sampling in thebelief space. Only a small number of action se-quences are initially allowed, restricting the resolu-tion of sampling. The number of action sequencesis increased as required, allowing the sampling res-olution to scale up gradually with the complexityof the problem.

We tested MiGS in simulation on several difficultPOMDPs that model distinct robotic tasks withlong time horizons. These POMDPs are impossi-ble to solve with the fastest point-based POMDPsolvers today, but MiGS solved them in a few min-utes.

2 Background

2.1 Motion Planning under Uncertainty

Despite its importance and more than threedecades of active research [12, 25], motion planningunder uncertainty remains a challenge in robotics.Several recent successful algorithms are based on

2

the probabilistic sampling approach. Stochas-tic Motion Roadmap [1] combines PRM with theMarkov decision process (MDP) framework to han-dle uncertainty in robot control, but it does nottake into account uncertainty in sensing. BeliefRoadmap [17] handles uncertainty in both robotcontrol and sensing, but one major limitation is theassumption that the uncertainties can be modeledas Gaussian distributions. Unimodal distributionssuch as the Gaussian are inadequate when robotsoperate in complex geometric environments. Usinga mixture of Gausssians can partially address thisweakness, but at a significant computational cost.

POMDPs are a principled general frameworkthat can overcome the above limitations. By tack-ling the difficulty of long planning horizons, MiGSbrings POMDPs a step closer to being practical forcomplex robotics tasks.

2.2 POMDPs

A POMDP models an agent taking a sequenceof actions under uncertainty to maximize itsreward. Formally, it is specified as a tuple(S,A,O, T , Z,R, γ), where S is a set of states de-scribing the agent and the environment, A is theset of actions that the agent may take, and O isthe set of observations that the agent may receive.The remaining elements of the tuple are explainedbelow.

In each time step, the agent lies in a state s ∈ S,takes an action a ∈ A, and moves from a startstate s to an end state s′. Due to the uncertaintyin action, the end state s′ is modeled as a condi-tional probability function T (s, a, s′) = p(s′|s, a),which gives the probability that the agent lies ins′, after taking action a in state s. The agentthen receives an observation that provides infor-mation on its new state s′. Due to the uncertaintyin observation, the observation result o ∈ O isagain modeled as a conditional probability func-tion Z(s′, a, o) = p(o|s′, a).

To elicit desirable agent behavior, we define asuitable reward function R(s, a). In each step, theagent receives a real-valued reward R(s, a), if ittakes action a in state s. The goal of the agent isto maximize its expected total reward by choosinga suitable sequence of actions. When the sequenceof actions has infinite length, we typically specifya discount factor γ ∈ (0, 1) so that the total re-ward is finite and the problem is well defined. Inthis case, the expected total reward is given by

E [∑∞

t=0 γtR(st, at)], where st and at denote the

agent’s state and action at time t.POMDP planning means computing an optimal

policy that maximizes the agent’s expected totalreward. In the more familiar case where the agent’sstate is fully observable, a policy prescribes an ac-tion, given the agent’s current state. However, aPOMDP agent’s state is partially observable andnot known exactly. So we rely on the concept of abelief. A POMDP policy π : B → A is a mappingfrom B to A, which prescribes an action a, giventhe agent’s belief b.

A policy π induces a value function Vπ(b). Thevalue function V π(b) = E [

∑∞t=0 γ

tR(st, at)|b, π]specifies the expected total reward of executingpolicy π: the agent has initial belief b, and its ac-tion at at time t is chosen according to π. It isknown that V ∗, the value function associated withan optimal policy π∗, can be approximated arbi-trarily closely by a convex, piecewise-linear func-tion,

V (b) = maxα∈Γ

(α · b) = maxα∈Γ

(∑s∈S

α(s)b(s))

(1)

where Γ is a finite set of vectors called α-vectorsand b is a belief. Each α-vector is associated withan action a. The policy can be executed by select-ing the action corresponding to the best α-vectorat the current belief. So a policy can be repre-sented as a set Γ of α-vectors. For each α ∈ Γ, α(s)is the expected total reward that the agent receiveswhen it starts in s, takes the action a associatedwith α, and acts according to the policy definedby Γ afterwards. Policy computation, which, inthis case, involves the construction of Γ, is usuallyperformed offline.

Given a policy, represented as a set Γ of α-vectors, the control of the agent’s actions, alsocalled policy execution, is performed online in realtime. It consists of two steps executed repeatedly.The first step is action selection. If the agent’s cur-rent belief is b, it finds the action a that maximizesV (b) by evaluating (1). The second step is beliefupdate. After the agent takes an action a and re-ceives an observation o, its new belief b′ is given by

b′(s′) = τ(b, a, o) = ηZ(s′, a, o)∑s

T (s, a, s′)b(s)

(2)where η is a normalization constant.

More information on POMDPs is available in[9, 25].

3

2.3 Related POMDP Solvers

Large state spaces and long time horizons, typi-cal of many robot motion planning tasks, are twomain obstacles that hinder the use of POMDPs inpractice. Many approaches have been proposed toalleviate these difficulties.

Point-based POMDP solvers are currently oneof the most successful approaches. Point-basedsolvers reduce the complexity of planning in thebelief space by representing B as a set of sam-pled beliefs. Interestingly, probabilistic sampling,a key idea of point-based solvers, is also whatmakes PRM planning successful in geometric mo-tion planning. For computational efficiency, mostof the recent point-based solvers [11, 14, 20, 22, 23]sample only from a subset R ⊆ B reachable froma given initial belief. They differ in how they sam-ple R. PBVI [14] and Perseus [23] spread sam-pled points over R to cover it well. HSVI2 [22]maintains upper and lower bounds on the opti-mal value function and samples R to close the gapbetween the upper and lower bounds. FSVI [20]uses only the upper bound to guide sampling.SARSOP [11] biases sampling towards optimallyreachable spaces, the subset of beliefs reachableunder optimal policies. Point-based solvers havemade impressive progress in computing approx-imate solutions for POMDPs with large statespaces, and they have been successfully appliedto a variety of non-trivial robotic tasks, includ-ing localization [16, 18], coastal navigation [16, 18],grasping [5], target tracking [14, 8], and explo-ration [11, 22]. Despite the progress, even the bestpoint-based POMDP solvers today have significantdifficulty with robotic tasks that require long plan-ning horizons.

MiGS follows the approach of point-basedPOMDP solvers, but aims at overcoming the diffi-culty of long planning horizons. It uses sequencesof actions, rather than single primitive actions, toexplore the belief space and thus significantly re-duces effective planning horizons.

3 Milestone Guided Sampling

A key idea of point-based POMDP solvers is tosample a set of points from B and use it as an ap-proximate representation of B. Let R ⊆ B be theset of points reachable from a given initial beliefpoint b0 ∈ B under arbitrary sequences of actions

a1 a2

o1 o2

b0

Figure 1: The belief tree rooted at b0.

and observations. Most of the recent point-basedPOMDP algorithms sample from R instead of Bfor computational efficiency. The sampled pointsform a belief tree TR (Figure 1). Each node of TRrepresents a sampled point b ∈ B. The root ofTR is the initial belief point b0. To sample a newpoint b′, we pick a node b from TR as well as an ac-tion a ∈ A and an observation o ∈ O according tosuitable probability distributions or heuristics. Wethen compute b′ = τ(b, a, o) using (2) and insert b′

into TR as a child of b. If a POMDP requires aneffective planning horizon of h actions and obser-vations, TR may contain Θ

((|A||O|)h

)nodes in the

worst case, where |A| is the number of actions and|O| is the number of observations for the POMDP.Thus any point-based solvers trying to constructTR exhaustively must have running time exponen-tial in h and suffer from the “curse of history”.

To overcome this difficulty, let us consider thespace from which we must sample. Suppose thatthe effective planning horizon is h. We must sam-ple from a subset of R that contains Rh, the setof belief points reachable from b0 with at most hactions and observations. Our difficulty is that thesize of Rh grows exponentially with h. The basicidea of MiGS is to sampleRh hierarchically at mul-tiple resolutions and avoid exhaustively samplingRh unless necessary.

To do so, MiGS builds a roadmap graph Min a robot’s state space S. The nodes of Mare states sampled from S and are called mile-stones. An edge e between two milestones s ands′ of M is annotated with a sequence of actions(a1, a2, . . . , a`) that bring the robot from s to s′.The edge e is also annotated with a sequence ofstates (s0, s1, . . . , s`) that the robot traverses un-der the actions (a1, a2, . . . , a`), with s0 = s ands` = s′. A path in M encodes a long action se-quence obtained by concatenating the action se-

4

quences associated with the edges along the path.By traversing M suitably, we can generate a richcollection of action sequences for sampling B. Ata node b of TR, we can simply apply a sequence ofactions associated with an edge of M , instead of asingle action, and derive a child node b′.

The resulting belief tree TR then becomes muchsmaller. Suppose, for example, that M has maxi-mum degree d and the minimum length of an ac-tion sequence contained in each edge of M is `.Then, for a POMDP with time horizon h, the num-ber of nodes in TR is O

((d |O|`)h/`

)= O(dh/`|O|h)

in the worst case. This indicates that the action se-quences encoded in M help in reducing the effectof long planning horizons due to actions. Sincethe size of TR grows exponentially with h, the im-provement is significant. MiGS uses an additionaltechnique for reducing the long-horizon effect dueto observations, but it is secondary and will be de-scribed later.

Now it should be clear that MiGS is potentiallyfaster, as the belief tree TR is smaller. However, amore fundamental question remains. The roadmapM is a simplified representation of the state spaceS. In general, M contains only a subset of sampledstates and not all states in S. Do the belief pointssampled with the help of M cover the reachablebelief space well and is it likely to lead to a goodapproximation to the optimal value function andthe optimal policy? The answer is yes, if we sampleS adequately in a sense which we now explain.

Our main intuition is that the underlying statespace S of a robotic system is typically continuous.The POMDP state space S is a discretization ofS. States that are close together in S should havesimilar properties. Formally, denote by Γ∗ a setof α-vectors representing the optimal value func-tion. Given a constant ε > 0, we partition the statespace S into a collection Kε of disjoint subsets sothat for any α ∈ Γ∗ and any two states s and s′ inthe same subset K ∈ Kε, we have |α(s)−α(s′)| ≤ ε.Intuitively, the partitioning condition means thatany two states in the same subset K are similarin terms of their significance in the optimal valuefunction. The constant ε controls the resolution ofpartitioning. We sometimes omit the subscript inKε to simplify the notation. The partitioning Kinduces a distance metric on the belief space B:

Definition 1 Let K be an ε-partitioning of thestate space S. The distance between any two be-

liefs b and b′ in B with respect to K is

dK(b, b′) =∑K∈K

∣∣∣∑s∈K

b(s)−∑s∈K

b′(s)∣∣∣. (3)

This new metric is weaker than the usual L1 metricand is upper-bounded by the L1 metric. It mea-sures the difference in probability mass for subsetsof states rather than individual states. This is de-sirable, because the states within a single subsetK ∈ K are similar under our assumption and thereis no need to distinguish them. Now we can derivea Lipschitz condition on the optimal value functionV ∗(b) under dK.

Theorem 1 Let K be an ε-partitioning of the statespace S, and let B be the corresponding belief spaceover S. For any b, b′ ∈ B , if dK(b, b′) ≤ δ, then|V ∗(b) − V ∗(b′)| ≤ Rmax

1−γ δ + 2ε, where Rmax =maxs∈S,a∈A |R(s, a)|.

The proof is given in the appendix. Theorem 1provides a sampling criterion for approximating V ∗

well. Suppose that we sample a setB of beliefs thatcovers B in the sense that for any b ∈ B, there isa point b′ in B with dK(b, b′) ≤ δ. Then Theo-rem 1 implies that we can approximate V ∗(b) forany b ∈ B by V ∗(b′) for some b′ ∈ B with a smallerror that depends on δ. What is even more inter-esting is that the distance function dK(b, b′) relaxesthe covering requirement by averaging out varia-tions of b(s) and b′(s) within the same partitionK ∈ K, at the cost of introducing an additionalerror ε in the value function approximation. Thissuggests that it is probably sufficient to have onlyone representative state—a milestone—from eachpartition K ∈ K. These milestones can be used toconstruct an approximate partitioning, as done byMiGS during roadmap construction.

Of course, MiGS does not know the partition-ing K in advance. One simple way of ensuring onemilestone in each partition K ∈ K is to samplea large number of milestones from S uniformly atrandom. If each K ∈ K is sufficiently large in size,then we can guarantee that uniform sampling gen-erates at least one milestone from each K ∈ Kwith high probability. To improve efficiency, ob-serve that under an optimal policy, the robot likelygoes through states that provide high rewards orgood observations. So we bias the sampling to-wards such states. See Section 4 for details.

We now give a sketch of the overall algorithm.Suppose that we are given as input a POMDP

5

model (S,A,O, T , Z,R, γ). In addition, we aregiven a set of goal states Sg ⊆ S, which is commonfor robotic tasks. Each goal state g ∈ Sg is self-absorbing with T (g, a, g) = 1 for all a ∈ A, whichindicates that once a robot reaches a goal state,the task ends. The robot receives zero reward in agoal state and incurs a negative cost for actions inall other states. Formally, if g ∈ Sg, R(s, a) = 0 forall a ∈ A; otherwise, R(s, a) ≤ 0. Such a rewardfunction motivates the robot to reach the goal asefficiently as possible.

MiGS iterates over two stages. In the first stage,we sample a set of new milestones from S anduse it to construct and refine a roadmap M . Inthe second stage, we use M to guide sampling inB. Following the approach of point-based POMDPplanning, we perform value iteration [19] on a setΓ of α-vectors, which represents a piecewise-linearlower-bound approximation to the optimal valuefunction V ∗. Exploiting the fact that V ∗ must sat-isfy the Bellman equation, value iteration startswith an initial approximation to V ∗ and performsbackup operations on the approximation by iter-ating on the Bellman equation until the iterationconverges.

What is different in value iteration for point-based POMDP planning is that backup operationsare performed only at a set of sampled points fromB rather than the entire B. We sample this setfrom B by constructing a belief tree TR rooted atan initial belief point b0. To add a new node to TR,we first choose an existing node b in TR from lessdensely sampled regions of B, as this likely leads tosampled points that cover B well. We then choosea suitable edge e from the roadmap M and usethe associated action sequence (a1, a2, . . . , a`) andstate sequence (s0, s1, s2, . . . , s`) to generate a newnode.

Specifically, we first sample an observa-tion sequence (o1, o2, . . . , o`) consistent with thestate sequence (s0, s1, . . . , s`) in the sense thatZ(si, ai, oi) = p(oi|si, ai) > 0, for 1 ≤ i ≤ `. Theconsistency requirement limits the number of pos-sible observations. The main motivation for sam-pling observation sequences is similar to that of us-ing action sequences contained in M . Often, manyobservation sequences provide similar information.Sampling observation sequences avoids exploringmany similar belief space paths and helps to re-duce the long-horizon effect due to observations.

After obtaining the observation sequence,

we then apply the action-observation sequence(a1, o1, a2, o2, . . . , a`, o`) to b and generate a se-quence of new beliefs (b1, b2, . . . , b`), where

b1 = τ(b, a1, o1) (4)bi = τ(bi−1, ai−1, oi−1) for 2 ≤ i ≤ `. (5)

Finally, b` is inserted to TR as a new child node of b,while (b1, b2, . . . , b`−1) is associated with the edgefrom b to b` for backup operations. After creatingthe node b`, we perform backup operations for ev-ery belief associated with the nodes and edges ofTR along the path from b` to the root b0. A backupoperation at a node b improves the approximationof V ∗(b) by looking ahead one step. We performthe standard α-vector backup (Algorithm 1). Eachbackup operation creates a new α-vector, which isadded to Γ to improve the approximation of V ∗.

After a round of backup operations, we checkthe improvement in value function approximationat the root node b0. If the value no longer changesafter several rounds of backup, it likely indicatesthat all the information in the current roadmap hasbeen exploited and a more refined one is necessary.We then repeat the two stages of MiGS to refine Mand improve the value function approximation bysampling additional new beliefs. The details forroadmap construction and belief space samplingare described in Sections 4 and 5, respectively.

4 Roadmap Construction

To construct a roadmap M , MiGS samples a setof points from the state space S. They becomethe milestones, i.e., the nodes of M . The set Q ofmilestones induces a partitioning KQ of S into dis-joint subsets. The adjacency relationship betweenthe partitions KQ provides the edge connectivityof M . The partitioning KQ is an approximation ofKε, which is unknown in advance. To improve theapproximation, MiGS incrementally refines KQ byadding more and more milestones to Q.

4.1 Sampling Milestones

Ultimately we would like to sample a set Q ofmilestones so that each partition in Kε containsa milestone. For efficiency, we bias the samplingtowards states that provides higher rewards or in-formative observations. Specifically, we sample apoint s from S with probability

p(s) ∝ eλh(s),

6

Algorithm 1 Perform α-vector backup at a belief b.BACKUP(b, Γ)1: For all a ∈ A, o ∈ O, αa,o ← argmaxα∈Γ(α · τ(b, a, o)).2: For all a ∈ A, s ∈ S, αa(s)← R(s, a) + γ

∑o,s′ T (s, a, s′)Z(s′, a, o)αa,o(s′).

3: α′ ← argmaxa∈A(αa · b)4: Insert α′ into Γ.

where λ is a pre-specified positive constant thatcontrols the smoothness of the sampling distribu-tion and h(s) indicates the importance of a states. We define the heuristic importance function as

h(s) = (αh(s) +Rmax)bh(s),

where Rmax = maxs∈S,a∈A |R(s, a)|. If we ignorethe Rmax term for the moment, h(s) becomes sim-ply a product αh(s)bh(s). This is motivated bythe form of the optimal value function V ∗(b) =maxα∈Γ∗

∑s∈S α(s)b(s), which indicates that un-

der an optimal policy, the robot likely goes throughstates with large α(s)b(s) values. The term αh(s)in h(s) tries to capture those states with high ex-pected total reward. Suppose that the robot startsin s and receives the immediate reward R(s, a) fortaking action a in s. We assume that all actionsoccur with equal probability and that the robot re-ceives the maximum reward, which is 0 by our as-sumption, for all future actions. This gives an op-timistic estimate of the expected total reward thatthe robot can actually achieve, starting from s:

αh(s) =∑a∈A

p(a|s)R(s, a) =1|A|

∑a∈A

R(s, a).

Since R(s, a) ≤ 0 by our assumption, we need toinsert the additional term Rmax in h(s) so thatαh(s) +Rmax ≥ 0 for all s ∈ S and a large positivevalue of h(s) indicates an important state. The sec-ond component bh(s) tries to capture those stateswith informative observations. We identify suchstates by considering the probability p(s|a, o). Alarge p(s|a, o) value implies that if the robot arrivesin s by taking action a and receives observation o,then it recognizes the state s with high probability.We assume again that all actions occur with equalprobability, but that given an action a, the obser-vations occur according to the observation functionZ(s, a, o). We then define bh(s) as the weighted av-

erage of p(s|a, o) over all actions and observations:

bh(s) =∑a∈A

p(a|s)∑o∈O

Z(s, a, o)p(s|a, o)

=1|A|

∑a∈A

∑o∈O

Z(s, a, o)p(s|a, o).

To calculate p(s|a, o), we apply the Bayesrule. Assuming that the prior probabilitiesfor all states to occur given a are equal,we get p(s|a, o) = p(o|s, a)/

∑s∈S p(o|s, a) =

Z(s, a, o)/(|S|∑

s∈S Z(s, a, o)).

We choose to define the heuristic importancefunction h(s) using local information only. Othermore sophisticated definitions are possible, but wewould like to keep h(s) as simple as possible.

4.2 State Space Partitioning

The milestone set Q induces a partitioning KQ ofS. Each milestone q ∈ Q spans a partition K ∈ KQ

such that each state in K is “closer” to q than anyother milestones in Q. This is similar to a Voronoidiagram for a point set under the Euclidean dis-tance. However, Euclidean distance is unsuitablehere. The proximity measure should reflect howthe states differ in terms of their importance inthe optimal value function.

MiGS captures the proximity between states ina state graph. A state graph P is a weighted, di-rected multi-graph. The nodes of P are all thestates in S. There is an edge (s, s′, a) betweentwo nodes s and s′ and labeled with a, wheneverT (s, a, s′) > 0. The edge weight w(s, s′, a) triesto capture the difference in expected total rewardbetween s and s′. If the robot is in s, it can gotowards the goal directly, or it can take action ato reach s′ and then go towards the goal from s′.In the latter case, after taking action a, the robotmay end in a state s′′ 6= s′, due to uncertainty inrobot control. So we must take into account thedifference in expected total reward between s′ and

7

s′′. We define w(s, s′, a) by calculating

αw(s)− αw(s′)

= R(s, a) + γ∑

s′′∈S\{s′}

T (s, a, s′′)

×∑o∈O

Z(s′′, a, o)(αw(s′′)− αw(s′)), (6)

where αw(s) approximates the robot’s expected to-tal reward from a state s ∈ S. The first termR(s, a) in (6) is the immediate reward of takingaction a in s. The second term accounts for theoutcomes after taking the action. We can writeone such equation for each unknown αw(s) andsolve the resulting linear system of equations us-ing standard numerical methods. However, thiscould be computationally expensive, if the statespace S is large. We thus make a further approxi-mation. Since s′′ is reachable from s with a singleaction, we expect αw(s) and αw(s′′) to have simi-lar values if the reward function R(s, a) is smooth.Now replacing αw(s′′) with αw(s) in the right handside of (6) and after some simple algebra, we getαw(s)−αw(s′) = R(s, a)/(1− γ+ γT (s, a, s′)). Fi-nally, since R(s, a) ≤ 0 by our assumption, we de-fine

w(s, s′, a) =−R(s, a)

1− γ + γT (s, a, s′)

so that all edges weights are non-negative, and saythat a path is short if it has low weight.

The state graph P defines a proximity measuredP , where dP (s, s′) is the minimum weight over allpaths from s to s′. Although dP is not necessar-ily symmetric, it can still be used to partition S.Specifically, the partition K ∈ KQ for a milestoneq ∈ Q is a subset of states such that each state in Khas dP (s, q) ≤ dP (s, q′) for all q′ ∈ Q. Such a par-titioning KQ is in fact an inward Voronoi diagramof Q over the state space S [3].

4.3 Connecting Milestones

To complete the construction of roadmap M , weneed to insert edges between milestones and labelthem with action and state sequences. For this, weuse the adjacency relationship between the parti-tions in KQ. Let q and q′ be two milestones inQ. Let K and K ′ be the corresponding partitionsin KQ for q and q′, respectively. The states inK∪K ′ induce a subgraph PKK′ of P . We insert anedge (q, q′) from q to q′ into M if there is a pathfrom q to q′ in PKK′ . One may allow self-loops

in the roadmap M if the POMDP model containspure information-gathering actions, which changea robot’s belief, but not its state. We then finda shortest path from q to q′ in PKK′ and use theedges along this path to label (q, q′) with actionand state sequences accordingly. The edge (q, q′)is also assigned a weight, which is the weight of theshortest path.

4.4 Roadmap Refinement

To refine a roadmap M , we sample an additionalset Q′ of milestones. We then add Q′ to M andconstruct a new roadmap overQ∪Q′ incrementally,where Q denotes the milestones in M .

In a discrete POMDP, S is finite. So we canadd in all states in S as milestones. The resultingroadmap M would still be insufficient for generat-ing all possible sequences of actions, as M does notcontain all the actions between a pair of states. Asa final refinement, we can use the state graph P asa roadmap. This in principle makes planning com-plete, but is probably not necessary or desirable inpractice.

5 Belief space sampling

MiGS samples from B by incrementally construct-ing a belief tree TR rooted at an initial belief pointb0 (see Section 3). Adding a new node to TR con-sists of two steps. In the first step, we choose anexisting node b from TR. In the second step, weuse the roadmap M as a guide to choose an actionand an observation sequence, apply the action andthe observation sequence to b, and generate a newnode b′ as a child of b.

We want to spread the newly sampled pointsevenly over B in order to improve the coverage of B.A similar strategy is employed by PBVI [14], one ofthe early point-based POMDP solvers. However,we cover B under the distance dK, which providesa more relaxed covering requirement and makes iteasier to cover B. According to Theorem 1, this issufficient for obtaining a good approximation of theoptimal value function. We implement this idea byassigning a weight wc(b) to each node b in TR. Theweight wc(b) is the number of beliefs in TR withina small pre-specified distance of b according to dK.It gives an estimate of how densely the small regionnear b is sampled. We then choose an existing nodeb from TR with probability proportional to 1/wc(b),

8

in order to improve coverage in the less denselysampled regions.

To generate a new node b′, we use M as a guideto choose an action sequence and an observationsequence first. The roadmap M contains manypaths. Each path represents a long action sequenceobtained by concatenating the action sequences as-sociated with the edges along the path. If these ac-tion sequences are optimal, we can simulate theseactions and generate a set of beliefs on the op-timally reachable space R∗ ⊆ B, which containsthe subset of beliefs reachable from b0 under anoptimal policy. It is known that given a suffi-ciently large set of beliefs forming a cover of R∗,we can compute the corresponding optimal policyefficiently [7]. Of course, we do not know whichroadmap paths represent action sequences that areoptimal. So we try many different paths, but biastowards those which likely result from an optimalpolicy. To do so, we order the outgoing edges ateach node s in M by increasing edge weights andtry the action sequences associated with the edgesin this order. This helps us to explore differentpaths and avoid getting stuck in a local region.The details of generating a new node b′ are slightlydifferent, depending on whether the node b chosenin the first step is b0. If b = b0, conceptually we arestarting a new path in M . So we sample a states according to the belief b. We find the node s inM and choose the next outgoing edge (s, s′) thathas not been tried yet. If all edge have been tried,we restart from the edge with the lowest weight.The state sequence associated with (s, s′) (see Sec-tion 3) is used to extract a valid observation se-quence. We then apply the action sequence asso-ciated with (s, s′) and the observation sequence tob and generate a new belief b′, using (4) and (5).The resulting new belief b′ is inserted into TR asa child node of b and labeled with s′ so that theroadmap path can be continued from s′ later on. Ifthe node b chosen in the first step is not b0, then bmust have an associated state s. We find the nodes in M , and the remaining steps are the same asthose for the case b = b0.

6 Experiments

We tested MiGS on four robotic tasks that re-quire long planning horizons. Below, we first de-scribe these tasks and examine the robot’s behav-ior under policies computed by MiGS (Section 6.1).

Next, we compare MiGS with other point-basedPOMDP solvers and alternative approaches to mo-tion planning under uncertainty (Section 6.2). Fi-nally, we look at the robustness of MiGS policies(Section 6.3).

6.1 Tasks and Computed Policies

2-D Navigation. A mobile robot navigates ina laboratory with obstacles and danger zones thatmust be avoided (Figure 2a). It starts in one ofthe two entrances to the lab, but does not knowexactly which one. It must reach one of the goalpositions. We represent the robot’s position on auniform grid of size 60×70. In each step, the robotmoves from the current grid cell to one of the eightadjacent ones. Due to imperfect control, the robotreaches its intended cell only 90% of the time. Forthe rest of the time, the robot either remains inthe current cell or drifts to the left or the rightof the intended cell. The robot can localize itselfonly in regions with landmarks. Despite imperfectcontrol and localization, the robot must reach thegoal as quickly as possible while avoiding obstaclesand danger zones.

A simulation run under a policy computed byMiGS is shown in Figure 2a. The robot starts atthe upper entrance. It first moves all the way downto localize itself. It then moves back up and turnsright into a corridor towards one of the goals. Sincethe robot does not know its starting position, itmust move down to localize itself first, despite theadditional distance traveled. Otherwise, it may getstranded in a danger zone.

Interestingly, although jagged paths are undesir-able in general, some parts of the jagged path inFigure 2a are in fact deliberate and improve therobustness of the robot’s motion in the presenceof uncertainty. For example, Figure 2b shows therobot makes a zigzag motion first down-left andthen down-right. Before the zigzag motion, thereis some probability that the robot may enter thewrong room. So it moves down-left to shift thisprobability mass out. Now the probability mass isconcentrated at the lower entrance. To avoid get-ting stuck there, the robot then moves down-rightto shift the probability mass towards the center ofthe corridor.

As another example (Figure 2c), the robot’s mo-tion as it turns right into the corridor is similar innature to that of wiggling a peg and inserting itinto a hole. By moving up and down at the mouth

9

(a)

(b) (c)

Figure 2: 2-D Navigation. (a) The robot’s path in a simulation run. (b) The robot makes jaggedmotion to reduce the probability of entering the wrong room. The arrows point to the probability massof interest. (c) The robot “wiggles” its way into the corridor. “S” marks the robot’s possible initialpositions. “G” marks the goal positions. “D” marks danger zones that must be avoided. Shaded discsmark landmark regions for robot localization. The blue line indicates the robot’s path in a simulationrun. The shaded blue region marks the robot’s belief on its own position. Darker colors indicate higherprobabilities.

10

(a) (b)

Figure 3: Sampled milestones. (a) The initial roadmap. (b) The final roadmap. Each “×” marks amilestone.

Figure 4: 3-D navigation. See the caption of Fig-ure 2 for the meanings of labels.

of the corridor, the robot tries to position itselfnear the center of the corridor and pushes all theprobability mass into it.

Figure 3 shows the sampled milestones in theroadmaps constructed by MiGS. In the initialroadmap, the milestones fall in some of the land-mark regions and one goal region, as they provideeither good observations or high rewards. In thefinal roadmap, the milestones cover all the land-mark regions and goal regions. Some danger zonesalso contain milestones, because the knowledge ofbeing in a danger zone provides information on therobot’s position.

3-D Navigation. An unmanned aerial vehicle(UAV) navigates in an indoor environment (Fig-ure 4), where GPS signal is not available. Theenvironment is populated with obstacles and dan-ger zones. The robot’s state is represented as(x, y, z, θp, θy), where (x, y, z) is the robot’s posi-tion, θp is the pitch angle, and θy is the yaw angle.The 3-D environment is discretized into a grid with18 × 14 horizontal positions and 5 height levels.The pitch angle is discretized into 3 values, andthe yaw angle, 8 values. The robot starts at theentrance to the environment, but does not knowits initial state exactly. In each step, the robot caneither rotate in place or move forward to one of theadjacent cells according to its heading. Rotationmotion is quite accurate. However, when the robotmoves forward, the robot reaches its intended stateonly 90% of the time; the rest of the time, it mayremain in the same state or drift to the left or rightof the intended state. The robot can localize itselfnear the landmarks.

The main difficulty in this task is similar to thatof 2-D Navigation, but the state space is muchlarger because of the 3-D environment. A typicalpolicy computed by MiGS takes the UAV throughthe rightmost route, which is longer but safer, be-cause no danger zones are present and there areseveral landmarks along the way to aid localiza-tion.

Target Finding. A robot wants to find a mov-ing target as quickly as possible in an environment

11

(a)

(b) (c) (d)

Figure 5: Target finding. (a) The environment. Courtesy of the Radish data set. (b) The belief onthe target’s position, before the robot clears the bottom target region. (c) The belief, after the robotclears the bottom target region. (d) The belief, after the robot clears the top-left target region. Thepink lines mark the boundaries of target regions. The shaded red regions mark the robot’s belief onthe target’s position. Darker colors indicate higher probabilities. The red hollow triangle marks thetarget’s actual position. See the caption of Figure 2 for the meanings of other labels.

containing obstacles and danger zones (Figure 5).The environment is represented as a uniform gridof size 29 × 49. The state space is parametrizedby two variables (xr, xt), where xr is the robot’sposition and xt is the target’s position. The targetmay lie in one of three regions. It can move freelywithin a region, but can not jump to a differentregion. The target’s behavior is unknown to therobot. The robot knows its starting position onlyroughly. In each step, the robot can move to oneof its eight adjacent cells. However, due to imper-fect control, the robot fails to reach the intendedcell 15% of the time. The robot can localize itselfnear a landmark. Due to sensor limitations, therobot can detect the target only when they are inthe same grid cell.

The strategy generated by MiGS balances theeffort of searching the target in a particular targetregion with the cost of moving between differenttarget regions. The robot first searches the bottomtarget region, which is close to its starting position.

It dispenses significant effort there until the prob-ability of finding the target in this region becomesvery small (Figure 5b). The robot then moves to-wards the upper target regions along the long, butsafer path to the right, in order to avoid acciden-tally entering the danger zones. Once there, therobot first searches the upper left target region,because the prior probability of finding the targetthere is higher than that in the upper right targetregion (Figure 5b). However, the robot does notsearch the upper left target region as hard as thebottom target region. When it leaves the upper leftregion, there is still some probability that the tar-get may be there (Figure 5c). The cost of movingbetween the upper left and right target regions arelow. It is advantageous to search the upper righttarget region and then return to the upper left tar-get region when necessary. Luckily, the robot findsthe target in the upper right region.Cooperative Navigation. Autonomous under-water vehicle (AUV) navigation in deep sea explo-

12

(a)

(b) (c) (d)

Figure 6: Cooperative Navigation. (a) The 3-D view (top), the top view (bottom left), and the bottomview (bottom right) of the environment. (b) The ASC’s path in a simulation run. (c) Although theASC is already at the southernmost position, it still moves southeast in order to reduce the chancethat the AUV gets blocked by obstacles. The arrow points to the obstacle that potentially blocks theAUV. (d) The ASC moves straight east, when the chance of having the AUV blocked by obstacles issufficiently small. The blue line indicates the ASC’s path. The shaded red region indicates the ASC’sbelief on the AUV’s position. Darker colors indicate higher probabilities. The dashed circle marks theboundary of the acoustic modem’s effective range with respect to the current ASC’s position. See thecaption of Figure 2 for the meanings of other labels.

ration missions is a major challenge, because it isdifficult to control the motion of AUVs accuratelyand AUVs often rely on dead-reckoning to estimatetheir positions. One interesting approach is to useautonomous surface crafts (ASCs) to guide AUVnavigation [4]. Here, we model a simplified ver-sion of this cooperative navigation task. We wantto find a motion strategy for an ASC to guide anAUV navigating from a start position to a goal po-sition, even though the ASC does not know AUV’sposition exactly (Figure 6a).

The underwater environment where the AUV

operates contains obstacles and danger zones. Thesurface environment where the ASC operates alsocontains obstacles. The obstacles layout and ge-ometry in the underwater environment may differfrom those in the surface environment. Both thesurface and the underwater environment are rep-resented as uniform grids of size 24× 11.

The initial positions of the ASC and the AUVare not known exactly. In each step, the ASCmoves to one of its eight adjacent cells. We assumethat the ASC’s control is sufficiently accurate andthe error is negligible. In contrast, the AUV’s con-

13

trol is quite noisy. In each step, it also moves toone of its eight adjacent cells, but reaches its in-tended cell only 60% of the time. For the rest ofthe time, it remains in its current cell or drifts tothe left or the right of the intended cell.

To help the AUV navigates, the ASC is equippedwith a GPS and an acoustic modem. The ASCtransmits the GPS information and its own move-ment information continuously through the acous-tic modem. When the AUV lies within a reliabletransmission range of the ASC’s acoustic modem,we assume that the AUV receives the GPS andmovement information with perfect accuracy.

The AUV moves according to the following rules.When the AUV’s estimated position is far fromthe boundary of the acoustic modem’s range limit,the AUV replicates the ASC’s action. When theAUV’s estimated position is close to the acousticmodem’s range limit, the AUV moves toward theASC to ensure reliable signal reception. When theAUV lies beyond the reliable transmission rangeof the acoustic modem, we assume conservativelythat the AUV receives no transmission and is lost.The AUV then moves in any direction with equalprobability.

To help the ASC locates the AUV, the AUVuses an acoustic modem to transmit its own po-sition. However, for efficiency, the AUV transmitsonly when it is able to localize itself well. TheAUV localizes using GPS information from theASC and from a few fixed transmission stationsinstalled in the environment. Each fixed trans-mission station transmits its position continuouslythrough an acoustic modem. Due to the acousticmodem’s range limit, the AUV can only localizewhen it is sufficiently close to the ASC and thefixed transmission stations.

The policy generated by MiGS moves the ASCin such a way to help the AUV reaches the goalsafely with high probability. To ensure the AUV’ssafety, the ASC deliberately takes a longer route(Figure 6b). The ASC also tries to ensure that theAUV is well within the reliable transmission rangeof the acoustic modem, based on its knowledge ofthe estimated AUV’s position and AUV’s behavior.For example, Figure 6c shows that the ASC keepsmoving in the southeast direction, though it knowsperfectly well from the GPS readings that it is al-ready at the southernmost position. It would ap-pear that moving east and making progress alongthe path towards the goal should be more effec-

tive. Closer examination of the ASC’s belief onthe AUV’s position reveals why the ASC does notdo so. If the ASC moves east, the AUV will repli-cate the action, but with substantial probability, itwill get blocked by the obstacle and remain in thesame position (Figure 6c). Now if the ASC contin-ues moving east, the AUV will eventually fall out ofthe acoustic modem’s transmission range and getlost. As a result, the ASC moves in the southeastdirection for a number of steps. It moves east onlyafter the probability of having the AUV blockedby the obstacle is sufficient small (Figure 6d).

6.2 Performance Comparison

We now compare MiGS quantitatively with otherwell-known approaches for motion planning andplanning under uncertainty.

6.2.1 Experimental Setup

We ran three other planning algorithms:PRM [10], QMDP [25], and HSVI2 [22]. PRMis one of the most successful algorithms formotion planning without uncertainty. Resultson PRM gives an indication of the importanceof uncertainty planning in the tasks described inSection 6.1. We did not run PRM on the TargetFinding task, because the classic PRM algorithmapplies only to problems with static goals, whilethe target position, which serves as the goal in thistask, moves. QMDP is a simple and yet practicalapproximate POMDP solver that is well knownin robotics. Finally, HSVI2 is one of the fastestpoint-based POMDP solvers available.

All the algorithms were implemented in C++.For HSVI2, we used ZMDP v1.1.6., the latest soft-ware package released by HSVI2’s original author.All the experiments were performed on a computerserver with a 2.66GHz Intel processor, 2GB mem-ory, and the Linux operating system.

For each task, we estimated the success rateof policies computed by MiGS. The success rateis the percentage of simulation runs in which therobot accomplishes a given task successfully withina specified time limit. Since MiGS is a randomizedalgorithm, we ran MiGS 30 times and performed100 simulation runs for each resulting policy. Theaverage success rate was then reported.

The performance of PRM was assessed in a sim-ilar way. For each task, we ran PRM 30 times. Ineach run, we first sampled a start state according

14

to the initial belief b0 and sampled a goal stateuniformly at random from the set of possible goalstates. We then ran PRM to generate a path fromthe start to the goal state. Finally, we treated thegenerated path as an open loop policy and per-formed 100 simulation runs under the policy to es-timate its average success rate.

QMDP and HSVI2 are both deterministic. Weran the algorithm until it either converged orreached a two hour limit. We then performed 3,000simulation runs to estimate the success rate of theresulting policy.

6.2.2 Results

Table 1: The success rates of policies computed byvarious algorithms.

PRM QMDP HSVI2 MiGS

2-D Navigation 0.0 0.0 0.0 93.03-D Navigation 14.4 0.1 24.0 97.4Target Finding — 11.7 12.2 95.0Coop Navigation 0.0 0.0 0.0 99.0

Figure 7: The success rate of MiGS policies withincreasing running times.

The results for the four test tasks show thatMiGS significantly outperforms the other algo-rithms under comparison (Table 1 and Figure 7).PRM performs badly, as it ignores uncertaintycompletely. Its failure indicates that handling un-certainty is a key issue in these tasks. QMDP es-sentially performs one-step lookahead search in thebelief space. This is insufficient for tasks that re-quire long planning horizons. While HSVI2 per-

forms multi-step lookahead, it relies heavily on itsheuristic to guide the search in the belief space.The heuristic is constructed from an MDP solu-tion, which assumes that the system state is fullyobservable in the current and all future time steps.As the planning horizon increases, this assumptionstarts to fail, and the heuristic becomes less effec-tive in guiding the belief space search. In contrast,using the action sequences in the roadmap, MiGSlooks ahead much further in a hierarchical fashion.

It is interesting to note that in 3D-Navigation,even though PRM ignores uncertainty, it performsbetter than QMDP, out of “luck”. If the UAVmoves from the starting state directly towards thegoal, it very likely passes through a danger zone,resulting in a failure. However, it may reach thegoal successfully occasionally. PRM ignores allthese and computes such a direct route, whichleads to an average success rate of 14%. QMDPperforms limited lookahead and recognizes the riskof this route. However, its one-step lookahead isnot long enough to generate an alternative pol-icy that brings the UAV to the goal consistently.Thus, the UAV ends up getting stuck in the startstate, resulting in complete failure in terms of suc-cess rate. By performing longer lookahead, HSVI2performs better than QMDP, and MiGS performsmuch better than both.

In Target Finding, both QMDP and HSVI2produce policies that find the target successfully,whenever it is the lower target region. The lowertarget region is close to the robot’s start position,and the planning horizon required to generate sucha policy is relatively short. However, to generatepolicies capable of finding the target in the twoupper target regions, we need a much longer plan-ning horizon. As a result, the policies produced byQMDP and HSVI2 after 2 hours of computationonly succeed around 12% of the time (Table 1),when the target is located in the lower target re-gion. In contrast, the policies produced by MiGSachieve 95% success rate after 10 minutes of com-putation (Figure 7).

6.3 Robustness of MiGS Policies

For many real-world robotic tasks, it is not easy tobuild accurate models of robot control and sens-ing. For instance, the motion of an AUV dependsheavily on the water current, whose effect is diffi-cult to model accurately [13]. Thus it is desirableto have POMDP policies that are robust against

15

Figure 8: 2-D navigation in a simple geometric en-vironment. The blue and red lines mark the robot’spaths in simulation runs under π90,90 from the twodifferent initial positions, respectively. See the cap-tion of Figure 2 for the meanings of labels.

such modeling errors.We performed preliminary tests on the robust-

ness of policies computed by MiGS. We used a 2D-navigation task (Figure 8) similar to the one de-scribed in Section 6.1, but in a simpler geometricenvironment. The environment is represented asa uniform grid of size 20 × 42. In each step, therobot may move to one of eight adjacent cells andreaches the intended cell t% of the time. The robotcan localize itself near landmarks, but it correctlyidentifies the landmark only z% of the time.

We applied MiGS to the POMDP model witht = 90 and z = 90. In other words, the modelassumes quite accurate robot motion control andlandmark identification. We simulated the result-ing policy π90,90, while systematically varying thevalues of t and z during the simulation, in orderto examine the robustness of π90,90, when t 6= 90or z 6= 90. For comparison, we used MiGS to com-pute polices πt,z for POMDP models with correctvalues of t and z.

With decreasing t value, the uncertainty in robotcontrol increases. It becomes more difficult for therobot to reach the goal reliably. A comparison ofsuccess rates for π90,90 and πt,z suggests that smallerrors in the robot motion model do not have asignificant effect here (Figure 9a). However, inrelative terms, the performance of π90,90 graduallydegrades as the model errors increase. When themodel errors are large, the reachable belief spaceunder the model with t = 90 and z = 90 could besubstantially different from that under the correctmodel. Thus, during the simulation, the robot mayencounter a belief not close to any one consideredduring the planning. It then has to resort to a de-

(a)

(b)

Figure 9: The effect of model errors on the per-formance of MiGS policies. (a) The success ratewith increasing error in the robot motion model.(b) The success rate with increasing error in thesensing model.

fault action, which often causes poor performance.Figure 9b shows that π90,90 is more tolerant of er-rors in the sensing model than those in the robotmotion model. The reason is that the successfulstrategy commands the robot to move counter-clockwise along the hallway and approach the goalfrom the right, thus avoiding the danger zones. Lo-calization using the landmarks makes this strategymore efficient to execute, but does not seem to playa critical role. Thus sensing model errors do notaffect the success rate significantly. In summary,these results suggest that MiGS policies are robustagainst small model errors, though more tests areneeded to confirm this.

7 Conclusion

This paper proposes Milestone Guided Sampling(MiGS), a new point-based POMDP solver forrobot motion planning tasks that require long timehorizons. MiGS samples a set of milestones from a

16

robot’s state space and constructs a roadmap as asimplified representation of the state space. Eachedge of the roadmap represents an action sequencethat brings the robot from one milestone to an-other. MiGS uses these action sequences in theroadmap, rather than single primitive actions, tosample the belief space, thus significantly reducingeffective planning horizons while still capturing theimportant features of the belief space. We success-fully tested MiGS in simulation on several difficultPOMDPs modeling distinct robotic tasks in both2-D and 3-D environments.

Two main obstacles have hindered the applica-tion of POMDPs to realistic robotic tasks: largestate spaces and long time horizons. The recentlyintroduced point-based algorithms have shown im-pressive progress in solving POMDPs with largestate spaces. However, their performance degradessignificantly when the planning task has a longtime horizon. By alleviating the difficulty of longhorizons, we hope that our work brings POMDPsone step closer to becoming a practical tool forrobot motion planning in uncertain and dynamicenvironments.

Acknowledgments. We thank Sylvie Ong andShao Wei Png for reading the first draft of this pa-per and helping with scripting a POMDP model.We also thank the anonymous reviewers whosemany suggestions helped to improve the presenta-tion of the paper. This work is supported in partby AcRF grant R-252-000-327-112 from the Singa-pore Ministry of Education. H. Kurniawati thanksN. M. Patrikalakis for his support.

References

[1] R. Alterovitz, T. Simeon, and K. Goldberg.The stochastic motion roadmap: A samplingframework for planning with Markov motionuncertainty. In Proc. Robotics: Science andSystems, 2007.

[2] H. Choset, K.M. Lynch, S. Hutchinson,G. Kantor, W. Burgard, L.E. Kavraki, andS. Thrun. Principles of Robot Motion: The-ory, Algorithms, and Implementations. TheMIT Press, 2005.

[3] M. Erwig. The graph Voronoi diagram withapplications. Networks, 36(3):156–163, 2000.

[4] M.F. Fallon, G. Papadopoulos, and J.J.Leonard. Cooperative AUV navigation usinga single surface craft. In Proc. Field & ServiceRobotics, 2009.

[5] K. Hsiao, L.P. Kaelbling, and T. Lozano-Perez. Grasping POMDPs. In Proc. IEEEInt. Conf. on Robotics & Automation, pages4685–4692, 2007.

[6] D. Hsu, J.C. Latombe, and H. Kurniawati. Onthe probabilistic foundations of probabilisticroadmap planning. Int. J. Robotics Research,25(7):627–643, 2006.

[7] D. Hsu, W.S. Lee, and N. Rong. What makessome POMDP problems easy to approximate?In Advances in Neural Information ProcessingSystems (NIPS), 2007.

[8] D. Hsu, W.S. Lee, and N. Rong. A point-based POMDP planner for target tracking. InProc. IEEE Int. Conf. on Robotics & Automa-tion, pages 2644–2650, 2008.

[9] L. Kaelbling, M. Littman, and A. Cassan-dra. Planning and acting in partially ob-servable stochastic domains. Artificial Intelli-gence, 101:99–134, 1998.

[10] L.E. Kavraki, P. Svestka, J.-C. Latombe, andM.H. Overmars. Probabilistic roadmaps forpath planning in high-dimensional configura-tion space. IEEE Trans. on Robotics & Au-tomation, 12(4):566–580, 1996.

[11] H. Kurniawati, D. Hsu, and W.S. Lee. SAR-SOP: Efficient point-based POMDP planningby approximating optimally reachable beliefspaces. In Proc. Robotics: Science and Sys-tems, 2008.

[12] J.-C. Latombe. Robot Motion Planning.Kluwer Academic Publishers, 1991.

[13] B.H. Ooi, H. Zheng, H. Kurniawati, W. Cho,M.H. Dao, J. Wei, P. Zemskyy, P. Tkalich,P. Malanotte-Rizzoli, and N.M. Patrikalakis.Multi-vehicle oceanographic feature explo-ration. In Proc. Int. Offshore & Polar En-gineering Conf., 2009.

[14] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithmfor POMDPs. In Int. Jnt. Conf. on ArtificialIntelligence, pages 1025–1032, August 2003.

17

[15] J. Pineau, G. Gordon, and S. Thrun. Policy-contingent abstraction for robust robot con-trol. In Proc. Uncertainty in Artificial Intel-ligence, pages 477–484, 2003.

[16] J. Pineau, M. Montemerlo, M. Pollack,N. Roy, and S. Thrun. Towards robotic as-sistants in nursing homes: Challenges andresults. Robotics and Autonomous Systems,42(3–4):271–281, 2003.

[17] S. Prentice and N. Roy. The belief roadmap:Efficient planning in linear POMDPs by fac-toring the covariance. In Proc. Int. Symp. onRobotics Research, 2007.

[18] N. Roy, G. Gordon, and S. Thrun. Findingapproximate POMDP solutions through be-lief compression. J. Artificial Intelligence Re-search, 23:1–40, 2005.

[19] S. Russell and P. Norvig. Artificial Intelli-gence: A Modern Approach. Prentice Hall,2002.

[20] G. Shani, R.I. Brafman, and S.E. Shimony.Forward search value iteration for POMDPs.In Int. Jnt. Conf. on Artificial Intelligence,pages 2619–2624, 2007.

[21] R.D. Smallwood and E.J. Sondik. The op-timal control of partially observable Markovprocesses over a finite horizon. Operations Re-search, 21:1071–1088, 1973.

[22] T. Smith and R. Simmons. Point-basedPOMDP algorithms: Improved analysis andimplementation. In Proc. Uncertainty in Ar-tificial Intelligence, July 2005.

[23] M.T.J. Spaan and N. Vlassis. Perseus:Randomized point-based value iteration forPOMDPs. J. Artificial Intelligence Research,24:195–220, 2005.

[24] G. Theocharous and L. P. Kaelbling. Ap-proximate planning in POMDPs with macro-actions. In Advances in Neural InformationProcessing Systems (NIPS), 2003.

[25] S. Thrun, W. Burgard, and D. Fox. Proba-bilistic Robotics. The MIT Press, 2005.

A Proof of Theorem 1

Proof. The optimal value function V ∗ can beapproximated arbitrarily closely by a piecewise-linear convex function and represented as V ∗(b) =maxα∈Γ(α · b) for a suitable set Γ of α-vectors. Letα and α′ be the maximizing α-vectors at b andb′, respectively. Without loss of generality, assumeV ∗(b) ≥ V ∗(b′). Thus V ∗(b) − V ∗(b′) ≥ 0. Sinceα′ is a maximizer at b′, we have α′ · b′ ≥ α · b′ andV ∗(b)−V ∗(b′) = α·b−α′·b′ ≤ α·b−α·b′ ≤ α·(b−b′).It then follows that

|V ∗(b)− V ∗(b′)| ≤ |α · (b− b′)|. (7)

Next, we calculate the inner product on the right-hand side of (7) over the partitioned state space:

|V ∗(b)− V ∗(b′)| ≤∣∣∣∑s∈S

α(s)(b(s)− b′(s))∣∣∣

≤∣∣∣∑K∈K

∑s∈K

α(s)(b(s)− b′(s))∣∣∣

For any state sK in the subset K ∈ K. We have

|V ∗(b)− V ∗(b′)|

≤∣∣∣∑K∈K

∑s∈K

(α(s)− α(sK) + α(sK))(b(s)− b′(s))∣∣∣

≤∣∣∣∑K∈K

∑s∈K

(α(s)− α(sK))(b(s)− b′(s))∣∣∣

+∣∣∣∑K∈K

∑s∈K

α(sK)(b(s)− b′(s))∣∣∣. (8)

Let e1 and e2 denote the two terms in (8), respec-tively. We now bound e1 and e2 separately. By thedefinition of ε-partitioning, |α(s) − α(sK)| ≤ ε forall s and sK in K ∈ K. Thus,

e1 ≤∑K∈K

∑s∈K

ε|b(s)− b′(s)|

= ε∑s∈S|b(s)− b′(s)| ≤ 2ε, (9)

where the last inequality holds because the L1

distance between any two beliefs is no greaterthan 2. Let us now consider e2. Since the ab-solute values of α-vector coefficients are no morethan Rmax/(1− γ), it follows that

e2 ≤∑K∈K

∣∣∣α(sK)∣∣∣∣∣∣∑s∈K

(b(s)− b′(s))∣∣∣

≤∑K∈K

Rmax

1− γ

∣∣∣∑s∈K

b(s)− b′(s)∣∣∣. (10)

18

Using the condition dK(b, b′) ≤ δ, we get e2 ≤Rmax1−γ δ. Combining this with (8) and (9) gives the

desired result. 2

19