maximizing information gain via prediction reward · this paper • main theoretical result • how...
TRANSCRIPT
![Page 1: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/1.jpg)
Maximizing Information Gain via Prediction Reward
Yash Satsangi, Sungsu Lim, Shimon Whiteson, Frans Oliehoek, Martha White
![Page 2: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/2.jpg)
Active perception
Sensor selection Visual attention
The ability to take actions to reduce uncertainty
![Page 3: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/3.jpg)
Active perception as an RL taskReward:
⇢(b) = �H(b)
![Page 4: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/4.jpg)
Active perception as an RL task
Reward: ⇢(b) = �H(b)
Explicit belief
inference?
![Page 5: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/5.jpg)
This paper• Main theoretical result
• How can we design state-based reward functions that approximate information gain.
• Prediction reward are linear approximation to entropy.
• Deep Anticipatory Networks (DAN)
• A deep RL algorithm that trains two deep neural networks simultaneously on each other’s feedback.
• Useful when reward is a convex function of the belief of the agent.
• Experiments
• Sensor selection with DAN
• Visual attention with DAN
![Page 6: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/6.jpg)
Prediction rewardA connection between prediction reward and information gain
Prediction reward: reward agent for making accurate prediction.
Expected prediction reward
r’ for correct prediction
r’’ otherwise
![Page 7: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/7.jpg)
Main resultA connection between prediction reward and information gain
⇢(b) = �H(b)
⇢0(b)
<latexit sha1_base64="5KHyYJ9bUaFbPVnLxKltESaFziY=">AAAB83icbVBNSwMxEJ2tX7V+VT16CbZivZTdIqi3ohePFewHdJeSTdM2NJssSVYoS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZF8acaeO6305ubX1jcyu/XdjZ3ds/KB4etbRMFKFNIrlUnRBrypmgTcMMp51YURyFnLbD8d3Mbz9RpZkUj2YS0yDCQ8EGjGBjJb/sq5E8r6Th9KLcK5bcqjsHWiVeRkqQodErfvl9SZKICkM41rrrubEJUqwMI5xOC36iaYzJGA9p11KBI6qDdH7zFJ1ZpY8GUtkSBs3V3xMpjrSeRKHtjLAZ6WVvJv7ndRMzuA5SJuLEUEEWiwYJR0aiWQCozxQlhk8swUQxeysiI6wwMTamgg3BW355lbRqVe+yevNQK9VvszjycAKnUAEPrqAO99CAJhCI4Rle4c1JnBfn3flYtOacbOYY/sD5/AGNDpC7</latexit>
expected prediction reward
constant term
✏1
<latexit sha1_base64="ZwN7C2KGpUGWyQTjXS7eUjXHlRo=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCdynZdLYNzSYhyQql9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5seLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSmKbSo5FJ3YmKAMwEtyyyHjtJA0pjDYzy6nfmPT6ANk+LBjhVEKRkIljBKrJPCagjKMC5FL6j2yhW/5s+BV0mQkwrK0eyVv8K+pFkKwlJOjOkGvrLRhGjLKIdpKcwMKEJHZABdRwVJwUST+c1TfOaUPk6kdiUsnqu/JyYkNWacxq4zJXZolr2Z+J/XzWxyFU2YUJkFQReLkoxjK/EsANxnGqjlY0cI1czdiumQaEKti6nkQgiWX14l7XotuKhd39crjZs8jiI6QafoHAXoEjXQHWqiFqJIoWf0it68zHvx3r2PRWvBy2eO0R94nz834JEq</latexit>
✏2
<latexit sha1_base64="eyKmuJ/6IHI69Xzd56EWrhTc7AE=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCt5RsOtuGZpOQZIWy9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5keLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSqKbSo5FJ3ImKAMwEtyyyHjtJAkojDYzS+nfmPT6ANk+LBThT0EjIULGaUWCeF1RCUYVyKfr3aL1f8mj8HXiVBTiooR7Nf/goHkqYJCEs5MaYb+Mr2MqItoxympTA1oAgdkyF0HRUkAdPL5jdP8ZlTBjiW2pWweK7+nshIYswkiVxnQuzILHsz8T+vm9r4qpcxoVILgi4WxSnHVuJZAHjANFDLJ44Qqpm7FdMR0YRaF1PJhRAsv7xK2vVacFG7vq9XGjd5HEV0gk7ROQrQJWqgO9RELUSRQs/oFb15qffivXsfi9aCl88coz/wPn8AOWWRKw==</latexit>
![Page 8: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/8.jpg)
Main consequencesA connection between prediction reward and information gain
Can estimate using samples
Question answering
Visual attention
Intrinsic motivation
This paper
Active sensing
Active perception
Sensor selection
![Page 9: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/9.jpg)
DAN: Deep Anticipatory NetworksTrain Q and M network simultaneously
Q agent is rewarded is M agent predicts the unknown variable correctly.
![Page 10: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/10.jpg)
ExperimentsSensor selection
Baselines • Coverage • Random • Coverage + DAN • Shared representations
At each time step: • Agent must select 1 out of 10 sensors to process observations from. • Agent is rewarded for correctly predicting the <x-y> position of a person
![Page 11: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/11.jpg)
ExperimentsSensor selection
Correct Predictions in Multi-person Tracking
100
80
60
40
20
02010521
Num. Tracked People
Cor
rect
Pre
dict
ions
per
Epi
sode
DAN + CoverageCoverageRandom Policy
DAN sharedDAN
![Page 12: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/12.jpg)
ExperimentsVisual attention
![Page 13: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/13.jpg)
ExperimentsVisual attention
Test Curve in Terminal Reward Setting1.0
0.8
0.6
0.4
0.2
0
Tota
l rew
ard
in a
n ep
isod
e (o
ut o
f 1)
150001000050000 20000Training Episodes
Fashion-MNIST terminal-rewardMNIST terminal-reward
MNIST DANFashion-MNIST DAN
Test Curve in Continuous Reward Setting12
10
8
6
4
2
0
Tota
l rew
ard
in a
n ep
isod
e (o
ut o
f 12)
150001000050000 20000Training Episodes
Fashion-MNIST terminal-rewardMNIST terminal-reward
MNIST DANFashion-MNIST DAN
![Page 14: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •](https://reader036.vdocuments.us/reader036/viewer/2022070814/5f0e11777e708231d43d7588/html5/thumbnails/14.jpg)
Thank you!Contact:[email protected]
yashsatsangi.com
Summary • Connection between prediction rewards and information gain.
• Compute information gain estimates without explicit belief inference.