maximizing information gain via prediction reward · this paper • main theoretical result • how...

Post on 26-Jun-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Maximizing Information Gain via Prediction Reward

Yash Satsangi, Sungsu Lim, Shimon Whiteson, Frans Oliehoek, Martha White

Active perception

Sensor selection Visual attention

The ability to take actions to reduce uncertainty

Active perception as an RL taskReward:

⇢(b) = �H(b)

Active perception as an RL task

Reward: ⇢(b) = �H(b)

Explicit belief

inference?

This paper• Main theoretical result

• How can we design state-based reward functions that approximate information gain.

• Prediction reward are linear approximation to entropy.

• Deep Anticipatory Networks (DAN)

• A deep RL algorithm that trains two deep neural networks simultaneously on each other’s feedback.

• Useful when reward is a convex function of the belief of the agent.

• Experiments

• Sensor selection with DAN

• Visual attention with DAN

Prediction rewardA connection between prediction reward and information gain

Prediction reward: reward agent for making accurate prediction.

Expected prediction reward

r’ for correct prediction

r’’ otherwise

Main resultA connection between prediction reward and information gain

⇢(b) = �H(b)

⇢0(b)

<latexit sha1_base64="5KHyYJ9bUaFbPVnLxKltESaFziY=">AAAB83icbVBNSwMxEJ2tX7V+VT16CbZivZTdIqi3ohePFewHdJeSTdM2NJssSVYoS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZF8acaeO6305ubX1jcyu/XdjZ3ds/KB4etbRMFKFNIrlUnRBrypmgTcMMp51YURyFnLbD8d3Mbz9RpZkUj2YS0yDCQ8EGjGBjJb/sq5E8r6Th9KLcK5bcqjsHWiVeRkqQodErfvl9SZKICkM41rrrubEJUqwMI5xOC36iaYzJGA9p11KBI6qDdH7zFJ1ZpY8GUtkSBs3V3xMpjrSeRKHtjLAZ6WVvJv7ndRMzuA5SJuLEUEEWiwYJR0aiWQCozxQlhk8swUQxeysiI6wwMTamgg3BW355lbRqVe+yevNQK9VvszjycAKnUAEPrqAO99CAJhCI4Rle4c1JnBfn3flYtOacbOYY/sD5/AGNDpC7</latexit>

expected prediction reward

constant term

✏1

<latexit sha1_base64="ZwN7C2KGpUGWyQTjXS7eUjXHlRo=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCdynZdLYNzSYhyQql9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5seLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSmKbSo5FJ3YmKAMwEtyyyHjtJA0pjDYzy6nfmPT6ANk+LBjhVEKRkIljBKrJPCagjKMC5FL6j2yhW/5s+BV0mQkwrK0eyVv8K+pFkKwlJOjOkGvrLRhGjLKIdpKcwMKEJHZABdRwVJwUST+c1TfOaUPk6kdiUsnqu/JyYkNWacxq4zJXZolr2Z+J/XzWxyFU2YUJkFQReLkoxjK/EsANxnGqjlY0cI1czdiumQaEKti6nkQgiWX14l7XotuKhd39crjZs8jiI6QafoHAXoEjXQHWqiFqJIoWf0it68zHvx3r2PRWvBy2eO0R94nz834JEq</latexit>

✏2

<latexit sha1_base64="eyKmuJ/6IHI69Xzd56EWrhTc7AE=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCt5RsOtuGZpOQZIWy9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5keLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSqKbSo5FJ3ImKAMwEtyyyHjtJAkojDYzS+nfmPT6ANk+LBThT0EjIULGaUWCeF1RCUYVyKfr3aL1f8mj8HXiVBTiooR7Nf/goHkqYJCEs5MaYb+Mr2MqItoxympTA1oAgdkyF0HRUkAdPL5jdP8ZlTBjiW2pWweK7+nshIYswkiVxnQuzILHsz8T+vm9r4qpcxoVILgi4WxSnHVuJZAHjANFDLJ44Qqpm7FdMR0YRaF1PJhRAsv7xK2vVacFG7vq9XGjd5HEV0gk7ROQrQJWqgO9RELUSRQs/oFb15qffivXsfi9aCl88coz/wPn8AOWWRKw==</latexit>

Main consequencesA connection between prediction reward and information gain

Can estimate using samples

Question answering

Visual attention

Intrinsic motivation

This paper

Active sensing

Active perception

Sensor selection

DAN: Deep Anticipatory NetworksTrain Q and M network simultaneously

Q agent is rewarded is M agent predicts the unknown variable correctly.

ExperimentsSensor selection

Baselines • Coverage • Random • Coverage + DAN • Shared representations

At each time step: • Agent must select 1 out of 10 sensors to process observations from. • Agent is rewarded for correctly predicting the <x-y> position of a person

ExperimentsSensor selection

Correct Predictions in Multi-person Tracking

100

80

60

40

20

02010521

Num. Tracked People

Cor

rect

Pre

dict

ions

per

Epi

sode

DAN + CoverageCoverageRandom Policy

DAN sharedDAN

ExperimentsVisual attention

ExperimentsVisual attention

Test Curve in Terminal Reward Setting1.0

0.8

0.6

0.4

0.2

0

Tota

l rew

ard

in a

n ep

isod

e (o

ut o

f 1)

150001000050000 20000Training Episodes

Fashion-MNIST terminal-rewardMNIST terminal-reward

MNIST DANFashion-MNIST DAN

Test Curve in Continuous Reward Setting12

10

8

6

4

2

0

Tota

l rew

ard

in a

n ep

isod

e (o

ut o

f 12)

150001000050000 20000Training Episodes

Fashion-MNIST terminal-rewardMNIST terminal-reward

MNIST DANFashion-MNIST DAN

Thank you!Contact:Y.Satsangi@tilburguniversity.edu

yashsatsangi.com

Summary • Connection between prediction rewards and information gain.

• Compute information gain estimates without explicit belief inference.

top related