surveillance and environmental monitoring surveillance in ...vaibhav/talks/2014c.pdf · via...
Post on 01-Jun-2020
14 Views
Preview:
TRANSCRIPT
Surveillance in an Abruptly Changing Worldvia Multiarmed Bandits
Vaibhav Srivastava
Department of Mechanical & Aerospace Engineering
Princeton University
December 15, 2014
Joint work with: Paul Reverdy and Naomi Leonard
IEEE Conference on Decision and ControlLos Angeles, CA
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 1 / 12
Surveillance and Environmental Monitoring
Picture Credit: http://www.kevindemarco.com
Underwater Search Underwater Robotic Testbed at Princeton University
1Repetitive search of object of interest, e.g., a certain type of algae
2Events of interest may arrive according to some process
3Noisy measurements and exploration-exploitation trade-o↵
4Environmental features are correlated
5Travel may be costly
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 2 / 12
Incomplete Literature Review
Environmental Monitoring and Surveillance
A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser. E�cient informative sensing using multiple robots.Journal of Artificial Intelligence Research, 34(2):707–755, 2009
G. A. Hollinger et. al. Underwater data collection using robotic sensor networks. IEEE Journal on Selected
Areas in Communications, 30(5):899–911, 2012
N. E. Leonard, D. A. Paley, F. Lekien, R. Sepulchre, D. M. Fratantoni, and R. E. Davis. Collective motion,sensor networks, and ocean sampling. Proc of the IEEE, 95(1):48–74, 2007
N. Sydney and D. A. Paley. Multivehicle coverage control for a nonstationary spatiotemporal field. Auto-matica, 50(5):1381–1390, 2014
R. Graham and J. Cortes. Adaptive information collection by robotic sensor networks for spatial estimation.IEEE Trans on Automatic Control, 57(6):1404–1419, 2012
Multi-armed Bandit ProblemsT. L. Lai and H. Robbins. Asymptotically e�cient adaptive allocation rules. Advances in Applied Mathe-
matics, 6(1):4–22, 1985
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine
learning, 47(2):235–256, 2002
E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. InICAIS, pages 592–600, April 2012
A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. arXivpreprint arXiv:0805.3415, 2008
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 3 / 12
Stochastic Multi-armed Bandits
N options with unknown mean rewards mi
the obtained reward is corrupted by noise
distribution of noise is known ⇠ N (0, �2s
)
can play only one option at a time
Pic Credit: Microsoft Rsch
Objective: maximize expected cumulative reward until time T
Equivalently: Minimize the cumulative regret
Cum. Regret =TX
t=1
�mmax � m
i
t
�.
mmax = max reward it
= arm picked at time t
Prototypical example of exploration-exploitation trade-o↵
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 4 / 12
Stochastic Multi-armed Bandits
N options with unknown mean rewards mi
the obtained reward is corrupted by noise
distribution of noise is known ⇠ N (0, �2s
)
can play only one option at a time
Pic Credit: Microsoft Rsch
Objective: maximize expected cumulative reward until time T
Equivalently: Minimize the cumulative regret
Cum. Regret =TX
t=1
�mmax � m
i
t
�.
mmax = max reward it
= arm picked at time t
Prototypical example of exploration-exploitation trade-o↵Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 4 / 12
Spatially Embedded Gaussian Multi-armed Bandits
reward at option i ⇠ N (mi
, �2s
)
prior on rewards m ⇠ N (µ0,⌃0)
spatial structure captured through ⌃0, e.g., �0ij
= �0 exp(�dij
/�)
value of option i at time t: Qt
i
=
⇣1 � 1
Kt
⌘-upper credible limit = µt
i|{z}exploit
+ �t
i
�
�1⇣1 � 1
Kt
⌘
| {z }explore
Inference Algorithm:
⇤
t
µt
= rt
�t
/�2s
+ ⇤
t�1µt�1
⇤
t
= �t
�T
t
/�2s
+ ⇤
t�1, ⌃
t
= ⇤
�1t
,
Upper Credible Limit (UCL) Algorithm:
- pick option with maximum value at each time
E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012
P. Reverdy, V. S., and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proc of the IEEE, 102(4):544–571, 2014
V. S., P. Reverdy, and N. E. Leonard. Correlated and dynamic multiarmed bandit problems: Bayesian algorithms and regret analysis. 2014. In prep
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 5 / 12
Spatially Embedded Gaussian Multi-armed Bandits
reward at option i ⇠ N (mi
, �2s
)
prior on rewards m ⇠ N (µ0,⌃0)
spatial structure captured through ⌃0, e.g., �0ij
= �0 exp(�dij
/�)
value of option i at time t: Qt
i
=
⇣1 � 1
Kt
⌘-upper credible limit = µt
i|{z}exploit
+ �t
i
�
�1⇣1 � 1
Kt
⌘
| {z }explore
Inference Algorithm:
⇤
t
µt
= rt
�t
/�2s
+ ⇤
t�1µt�1
⇤
t
= �t
�T
t
/�2s
+ ⇤
t�1, ⌃
t
= ⇤
�1t
,
Upper Credible Limit (UCL) Algorithm:
- pick option with maximum value at each time
E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In ICAIS, pages 592–600, April 2012
P. Reverdy, V. S., and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proc of the IEEE, 102(4):544–571, 2014
V. S., P. Reverdy, and N. E. Leonard. Correlated and dynamic multiarmed bandit problems: Bayesian algorithms and regret analysis. 2014. In prep
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 5 / 12
Gaussian multiarmed bandits with abrupt changesSliding-Window Approach: Description
the mean rewards switches to unknown values at unknown times
the switched rewards may have the same correlation scale
the number of switches until time T is upper bounded by ⇣T
Sliding-window UCL algorithm
estimate mean using observations at times {(t � tw
)
++ 1, . . . , t};
selects the arm i with the maximum value of
Qt,tw
i
:= µt,tw
i
+ �t,tw
i
�
�1⇣1 � 1
K min{tw
, t}
⌘,
an adaptation of the frequentist algorithm by Garivier and Moulines
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 6 / 12
Gaussian multiarmed bandits with abrupt changesSliding-Window Approach: Analysis
Sliding-window UCL algorithm
estimate mean using observations at times {(t � tw
)
++ 1, . . . , t};
selects the arm i with the maximum value of
Qt,tw
i
:= µt,tw
i
+ �t,tw
i
�
�1⇣1 � 1
K min{tw
, t}
⌘,
Analysis of Sliding-Window UCL algorithm
for ⇣T
= O(T ⌫), ⌫ 2 [0, 1) and t
w
=
⌃qT logT
⇣T
⌥
E[nTi
] O(T1+⌫2
plogT );
for ⇣T
�T , for some � 2 [0, 1), and tw
=
⌃q� log ��
⌥
E[nTi
] O(Tp
�� log �).
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 7 / 12
Gaussian multiarmed bandits with abrupt changesBlock Allocation Strategy: Description of Blocks
Block allocation to reduce travel cost
Divide sampling times into frames {1, . . . , L+ 1}L-th frame ends at 2
k
w
, kw
equivalent of width of time-window
k-th frame subdivided in blocks on length k 2 {1, . . . , L}(L+ 1)-th frame contains times {2kw + 1, . . . ,T}(L+ 1)-th frame subdivided in blocks on length k
w
20 21 22 23 2kw
2k�1 2k
k kk k
2k�1 2k 2�
Tz}|{frame fk
⌧k(r�1) |{z}block r
2kw T
kw kwkwkw
Frame structure20 21 22 23 2kw
2k�1 2k
k kk k
2k�1 2k 2�
Tz}|{frame fk
⌧k(r�1) |{z}block r
2kw T
kw kwkwkw
20 21 22 23 2kw
2k�1 2k
k kk k
2k�1 2k 2�
Tz}|{frame fk
⌧k(r�1) |{z}block r
2kw T
kw kwkwkw
Blocks Last FrameVaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 8 / 12
Gaussian multiarmed bandits with abrupt changesBlock Allocation Strategy: Description of algorithm
Block Sliding-Window UCL algorithm
At beginning of r -th block in k-th frame, i.e., at time ⌧kr
performs the estimation using the observations collected in the
time-window {(⌧kr
� 2
k
w
)
++ 1, . . . , ⌧
kr
};
selects the arm i with the maximum value of
Q⌧kr
,kw
i
:= µkr ,kw
i
+ �kr ,kw
i
�
�1(1 � 1/K min{2kw , ⌧
kr
}),
for the duration of the block
Block SW-UCL achieves same order of performance as SW-UCL
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 9 / 12
Numerical Illustration
Environment: 5 ⇥ 5 square grid
Reward at optimal m⇤= 10
Reward at other arms mj
= mi
exp(�0.3dij
), dij
= distance
Assumed correlation scale ⇢ij
= exp(�0.3dij
)
�2s
= 1 and �20 = 10
Number of changes ⇣T
= bpT c
102 103 104 1050
0.5
1
Horizon length
Subo
ptim
alar
m se
lect
ion
102 103 104 1050
0.5
1
Horizon length
Num
ber o
f tra
nsiti
ons
102 103 104 1050
0.5
1
Horizon length
Subo
ptim
alar
m se
lect
ion
102 103 104 1050
0.5
1
Horizon length
Num
ber o
f tra
nsiti
ons
Expt number of selections of suboptimal arms Expt number of transitions among arms
Black line: SWUCL Red line: Adaptive SWUCL Green line: Block SWUCLVaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 10 / 12
How important is the correlation scale
Beluga
Underwater Testbed Virtual reward surface
Experiment and Video Credit: Peter Langdren
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 11 / 12
Conclusions and Future Directions
Conclusions
1A multiarmed bandit framework for surveillance problems
2Arrival on events of interest =) Abrupt changes in reward surface
3Exploration-Exploitation trade-o↵ and role of correlation scale
4Block allocation to reduce travel cost
Future Directions
1Extension to multiple vehicles
2Environmental partitioning strategies catered to
addressing exploration-exploitation trade-o↵
3Extensions to continuously changing environments
Vaibhav Srivastava (Princeton University) Surveillance via Multi-armed Bandits December 15, 2014 12 / 12
top related