![Page 1: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/1.jpg)
River monitoring
Want to monitor ecological condition of river Need to decide where to make observations!
Mixing zone of San Joaquin and Merced rivers
7.4
7.6
7.8
8
Position along transect (m)pH
valu
e
NIMS(UCLA)
![Page 2: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/2.jpg)
Observation Selection for Spatial prediction
Gaussian processes Distribution over functions (e.g., how pH varies in space) Allows estimating uncertainty in prediction
Horizontal position
pH
valu
eobservations
Unobservedprocess
Prediction
Confidencebands
![Page 3: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/3.jpg)
Mutual Information[Caselton Zidek 1984]
Finite set of possible locations V For any subset A µ V, can compute
Want: A* = argmax MI(A) subject to |A| ≤ k
Finding A* is NP hard optimization problem
M I (A) = H (V nA) ¡ H (V nA j A)
Entropy ofuninstrumented
locationsafter sensing
Entropy ofuninstrumented
locationsbefore sensing
![Page 4: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/4.jpg)
Want to find: A* = argmax|A|=k MI(A) Greedy algorithm:
Start with A = ; For i = 1 to k
s* := argmaxs MI(A [ {s}) A := A [ {s*}
The greedy algorithm for finding optimal a priori sets
M I (Agreedy) ¸ (1 ¡ 1=e) maxA:jAj=k
M I (A) ¡ "M I
Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh]
Optimalsolution
Result ofgreedy algorithm
Constant factor, ~63%
1
2
3
4
5
![Page 5: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/5.jpg)
Sequential design
Observed variables depend on previous measurements and observation policy
MI() = expected MI score over outcome of observations
X5=?
X3 =? X2 =?
<20°C ¸ 20°C
X7 =?
>15°C
MI(X5=17, X3=16,
X7=19) = 3.4
X5=17X5=21
X3 =16
X7 =19 X12=? X23 =?
¸ 18°C<18°C
MI(…) = 2.1 MI(…) = 2.4
Observation
policy
MI() = 3.1
![Page 6: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/6.jpg)
A priori vs. sequential Sets are very simple policies. Hence:
maxA MI(A) · max MI() subject to |A|=||=k
Key question addressed in this work:
How much better is sequential vs. a priori design?
Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!
![Page 7: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/7.jpg)
GPs slightly more formally Set of locations V Joint distribution P(XV) For any A µ V, P(XA) Gaussian GP defined by
Prior mean (s) [often constant, e.g., 0] Kernel K(s,t)
7.4
7.6
7.8
8
Position along transect (m)
pH
valu
e
V… …
XV
1: Variance (Amplitude)
2: Bandwidth
K(s;t) = µ1 expµ
¡ks ¡ tk2
2
µ22
¶
Example: Squaredexponential kernel
4 2 0 2 4 0
0.5
1
Distance
Cor
rela
tion
![Page 8: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/8.jpg)
Known parametersKnown parameters
(bandwidth, variance, etc.)
No benefit in sequential design!
maxA MI(A) = max MI()
Mutual Information does not depend on observed values:
H(XB j XA = xA ) / logj§ (µ)BjA j
![Page 9: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/9.jpg)
Mutual Information does depend on observed values!
P (xB j xA ) =X
µ
P (µ j xA )N (xB ;¹ (µ)BjA ;§ (µ)
BjA )
Unknown parametersUnknown (discretized)
parameters: Prior P( = )
Sequential design can be better!
maxA MI(A) · max MI()
depends on observations!
![Page 10: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/10.jpg)
Theorem:
Key result: How big is the gap?
If = known: MI(A*) = MI(*) If “almost” known: MI(A*) ¼ MI(*)
MIMI(A*) MI(*)0
Gap depends on H()
M I (¼¤) · M I (A¤) + O(1)H (£ )
MI of best policyMI of best set Gap size
M I (¼¤) ·X
µ
P (µ) maxjA j=k
M I (A j µ) + H(£)
As H() ! 0:
MI of best policy MI of best param. spec. set
![Page 11: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/11.jpg)
Near-optimal policy if parameter approximately known Use greedy algorithm to optimize
MI(Agreedy | ) = P() MI(Agreedy | )
Note: | MI(A | ) – MI(A) | · H() Can compute MI(A | ) analytically, but not MI(A)
M I (Agreedy) ¸ (1 ¡ 1=e)M I (¼¤)¡ "¡ O(1)H (£ )Corollary [using our result from ICML 05]
Optimalseq. plan
Result ofgreedy algorithm
~63% Gap ≈ 0(known par.)
![Page 12: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/12.jpg)
Exploration—Exploitation for GPsReinforcementLearning
Active Learning in GPs
Parameters P(St+1|St, At), Rew(St) Kernel parameters
Known parameters:Exploitation
Find near-optimal policy by solving MDP!
Find near-optimal policy by finding best set
Unknown parameters:Exploration
Try to quickly learn parameters! Need to waste only polynomially many robots!
Try to quickly learn parameters. How many samples do we need?
![Page 13: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/13.jpg)
Parameter info-gain exploration (IGE) Gap depends on H() Intuitive heuristic: greedily select
s* = argmaxs I(; Xs) = argmaxs H() – H( | Xs)
Does not directly try to improve spatial prediction No sample complexity bounds
Parameter entropybefore observing s
P.E. after observing s
![Page 14: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/14.jpg)
Implicit exploration (IE) Intuition: Any observation will help us reduce H() Sequential greedy algorithm: Given previous
observations XA = xA, greedily select
s* = argmaxs MI ({Xs} | XA=xA, )
Contrary to a priori greedy, this algorithm takes observations into account (updates parameters)
Proposition: H( | X) · H() “Information never hurts” for policies
No samplecomplexity bounds
![Page 15: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/15.jpg)
Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance!
Learning the bandwidthKernel
Bandwidth
Sensors withinbandwidth are
correlated
Sensors outsidebandwidth are
≈ independent
AB C
![Page 16: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/16.jpg)
-4 -2 0 2 40
0.5
1
Square exponential kernel:
Choose pairs of samples at distance to test correlation!
K(s;t) = expµ
¡ks ¡ tk2
2
µ2
¶
Hypothesis testing:Distinguishing two bandwidths
Correlationunder BW=1
Correlationunder BW=3
At this distance correlation gap largest
BW = 1
BW = 3
-2 0 2
-2
0
2
-2 0 2
-2
0
2
![Page 17: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/17.jpg)
Hypothesis testing:Sample complexity
Theorem: To distinguish bandwidths with minimum gap in correlation and error < we need independent samples.
In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper)
Other tests can be used for variance/noise etc. What if we want to distinguish more than two
bandwidths?
N̂ = O³
1½2 log2 1
"
´
![Page 18: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/18.jpg)
1 2 3 4 50
0.2
0.4
0.6
P(
)
Find “most informative split” at posterior median
Hypothesis testing:Binary searching for bandwidth
Testing policy ITE needs only
logarithmically many tests!
ET [M I (¼I T E + Agreedy j £ )] ¸ (1¡ 1=e)M I (¼¤) ¡ k"M I ¡ O("T )
Theorem: If we have tests with error < T then
![Page 19: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/19.jpg)
Exploration—Exploitation Algorithm Exploration phase
Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase,
otherwise continue exploring. Exploitation phase
Use a priori greedy algorithm select remaining samples
For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!
![Page 20: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/20.jpg)
0 5 10 15 20
0.3
0.35
0.4
0.45
0.5
IE
ITE
IGE
0 5 10 15 20 25
0.5
1
1.5
2
IE
IGE
ITE
Results
None of the strategies dominates each other Usefulness depends on application
More
RM
S e
rror
More observations More observationsMore
para
m.
unce
rtain
ty
SERVER
LAB
KITCHEN
COPYELEC
PHONEQUIET
STORAGE
CONFERENCE
OFFICEOFFICE50
51
52 53
54
46
48
49
47
43
45
44
42 41
3739
38 36
33
3
6
10
11
12
13 14
1516
17
19
2021
22
242526283032
31
2729
23
18
9
5
8
7
4
34
1
2
3540
Temperature dataIGE: Parameter info-gain
ITE: Hypothesis testingIE: Implicit exploration
![Page 21: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/21.jpg)
10 20 30 40 500
1
Coordinates (m)
Nonstationarity by spatial partitioning Isotropic GP for each
region, weighted by region membership
spatially varying linear combination
Stationary fit Nonstationary fit Problem: Parameter space grows exponentially in #regions! Solution: Variational approximation (BK-style) allows efficient
approximate inference (Details in paper)
![Page 22: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/22.jpg)
0 10 20 30 400
0.05
0.1
0.15
0.2
IE,nonstationary
IE,isotropic
a priori,nonstationary
Results on river data
Nonstationary model + active learning lead to lower RMS error
More
RM
S e
rror
More observations Larger bars = later sample
10 20 30 40 500
1(14.54/0.04)
(13.10/0.03)(13.82/0.10)
(14.49/0.02)
Coordinates (m)
![Page 23: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/23.jpg)
0 5 10 15 200.5
1
1.5
IE,isotropic
IGE,nonstationary
IE,nonstationary
Random,nonstationary
0 5 10 15 20 25 306.5
7
7.5
8
8.5
9
9.5
10
IEnonstationary
IGEnonstationary
Results on temperature data
IE reduces error most quickly IGE reduces parameter entropy most quickly
More
RM
S e
rror
More observations More
para
m.
un
cert
ain
tyMore observations
![Page 24: Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint](https://reader035.vdocuments.us/reader035/viewer/2022062417/551a68ea550346b52d8b4c07/html5/thumbnails/24.jpg)
Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves
near-optimal exploitation If parameters unknown, perform exploration
Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample
complexity bounds! Each exploration strategy has its own advantages
Can use bound to compute stopping criterion Presented extensive evaluation on real world data