reinforcement learning in the control of attention
DESCRIPTION
Reinforcement Learning in the Control of Attention. Roderic A Grupen. Luiz M G Gonçalves. Laboratory for Analysis and Architecture of Systems (State University of Campinas-near future) www.laas.fr/~lmgarcia. Laboratory for Perceptual Robotics State University of Massachusetts (USA) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/1.jpg)
Reinforcement Learning in the Control of Attention
Roderic A Grupen
Laboratory for Analysis and Architecture of Systems(State University of Campinas-near future)www.laas.fr/~lmgarcia
Laboratory for Perceptual RoboticsState University of Massachusetts (USA)www-robotics.cs.umass.edu
Luiz M G Gonçalves
![Page 2: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/2.jpg)
Objective To develop a robotic system to
perform tasks involving attention and pattern categorization, integrating multi-modal (haptic and visual) information in a behaviorally cooperative active system.
![Page 3: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/3.jpg)
Motivation Towards finding an useful robotic
system able to: foveate (verge) the eyes onto a ROI; keep attention on the ROI if needed; choose another ROI (shift focus of
attention). Result is a behaviorally cooperative
active system, which provides on-line feedback to environmental stimuli in form of actions
![Page 4: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/4.jpg)
Method Use of (real time) visual information from
a stereo head and a simulator Selective Attention (bottom-up salience
maps) Multi feature extraction (perceptual state) Associative memory (pattern address
identification) Efficient topological mapping Learn policies to program the system
![Page 5: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/5.jpg)
Task Specification (Objectives)
Visual Monitoring or Environment Inspection Construction of an attentional map Keep this map consistent with a current
perception (update) Categorize all patterns
![Page 6: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/6.jpg)
Processo Markoviano Um processo estocástico cujo
passado não influencia o futuro se o seu presente está completamente especificado
Ex: Jogo de damas, Xadrez
)}(|)({}),(|)({
1
1
1
nnn
nnn
nn
tXxtXPtttXxtXP
tt
![Page 7: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/7.jpg)
Programação Dinâmica Percorrer todos os estados possíveis,
testando todas as possibilidades (executar todas as ações infinitamente)
Solução melhor (PD): Reduzir a complexidade de um problema
que pode ser resolvido em uma dimensão D para dois ou mais problemas em dimensões menores
Ex: Disparidade estéreo: 1 problema em 3D (x,y,d) é reduzido para
2 problemas em 2D (x,d) e (y,d)
![Page 8: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/8.jpg)
Pavlov Animal faz certo, ganha comida Animal faz errado, apanha Em teoria, é provado que apenas
um deles (recompensa ou punição funciona): fez coisa errada, não ganha comida.
Assim: robô fez certo => recompensa
![Page 9: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/9.jpg)
Reinforcement Learning(Related Work)
Watkins: Learning from Delayed Rewards (1989).
Sutton/Barto: Reinforcement Learning: An Introduction (1998).
Araujo: Learning a Control Composition in a Complex Environment (1996).
Huber: A Feedback Control Structure for On-line Learning tasks (1997).
Coelho: A Control Basis for Learning Multifingered Grasps (1997).
![Page 10: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/10.jpg)
Modelling a problem with delayed reinforcement as an MDP:
a set of states (estados) S, a set of actions (operadores) A, a reward function R:SxA, and a state transition function T:SxA
(S), which maps states transition to probabilities.
Q-learning equation:
s, aQa,sQrs, aQs, aQ MaxAa
![Page 11: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/11.jpg)
Q-learning equation
a = ação executada r = recompensa s’ = estado resultante de aplicar a A = todas as ações possíveis a’ de
serem executadas em s’ = learning rate (geralmente 0.1) = fator de disconto (geralmente 0.5)
s, aQa,sQrs, aQs, aQ MaxAa
![Page 12: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/12.jpg)
Observações Uma transição no espaço de
estados pode ser completamente caracterizada pelo vetor (s,a,r,s’)
Supondo que para todos os pares (s,a), Q(s,a) seja atualizado infinitamente (muitas vezes) para todo par (s,a), Q(s,a) converge com probabilidade 1 para a melhor recompensa possível para este par.
![Page 13: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/13.jpg)
Exploração e explotação Exploração; randomicamente escolher uma ação Explotação: após certo tempo, o sistema começa
a convergir, assim, escolhe-se ações que sabe-se estejam contribuindo para a convergência
Balancear entre exploração e explotação Temperatura (lembra Simulated Annealing)
Escolher randomicamente em função da temperatura (inicial alta, depois baixa)
Na prática, mesmo no final, ainda 10% randomico
![Page 14: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/14.jpg)
Algoritmo Q-learning 1) Define current state s by decoding sensory information available; 2) Use stochastic action selector to determine action a; 3) Perform action a, generating new state s’ and a reinforcement r; 4) Calculate temporal differencial error r’:
5) Update Q-value of the state/action pair(s,a)
6) Go to 1;
s, aQa,sQrr MaxAa
'
rs, aQs,aQ
![Page 15: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/15.jpg)
Elegibility trace Atualizar não apenas um par
estado-ação de cada vez, mas sim uma seqüência de pares (após execução de uma série de ações).
Ganho em convergência
![Page 16: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/16.jpg)
Na prática Uma tabela (Q-table) Linhas são os estados (s) Colunas são as ações (a) Elemento Q(s,a) são os Q-values,
valores dados pela função que avalia a utilidade de tomar a ação a quando o estado é s
![Page 17: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/17.jpg)
Roger-the-Crab
![Page 18: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/18.jpg)
Stereo Head Environment
![Page 19: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/19.jpg)
Degrees of Freedom (Controllers)
![Page 20: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/20.jpg)
System Control Architecture
![Page 21: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/21.jpg)
Low-level Control Defining a target
Pre-attentional phase (stimuli + internal biased)
Shifting attention (saccade generation) Fine saccade (using target model) Verging eyes onto a target (correlation) Movements are computed from errors to
image centers
![Page 22: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/22.jpg)
Low-level Control Identifying Objects
Selecting a region of interest Extracting features Associative memory match
Mapping objects and/or updating memory Pre-attentional maps Automatic supervised learning
![Page 23: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/23.jpg)
Behavioral Program
![Page 24: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/24.jpg)
A straight-forward control algorithm Step 0: Initialize the associative memory and start the
concurrent controllers of arms, neck, and eyes. Step 1: Re-direct attention; if a representation is activated,
update attentional maps and re-do this step (1). Step 2: Try a visual improvement; if a representation is
activated, update attentional maps and return to step 1. Step 3: Try an arm improvement; if a representation is
activated, update attentional maps and return to step 1; Step 4: Activate “supervised learning” module, update
attentional maps and return to step 1.
![Page 25: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/25.jpg)
Finite state machine
![Page 26: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/26.jpg)
Results
Q-learning convergence
![Page 27: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/27.jpg)
Partial Evaluation of strategies
AttentionalShifts
![Page 28: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/28.jpg)
Partial Evaluation of strategies
Visual/armImprov
![Page 29: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/29.jpg)
Partial Evaluation of strategies
ObjectsIdentified
![Page 30: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/30.jpg)
Partial Evaluation of strategies
New objects
![Page 31: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/31.jpg)
Global evaluation
Mapped objects
![Page 33: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/33.jpg)
Times for each phase or process Phase Min(sec) Max(sec) Mean(sec) Computing retina 0.145 0.189 0.166 Transfer to host 0.017 0.059 0.020 Total acquiring 0.162 0.255 0.186 Pre-attention 0.139 0.205 0.149 Salience map 0.067 0.134 0.075 Total attention 0.324 0.395 0.334 Total saccade 0.466 0.903 0.485 Features for match 0.135 0.158 0.150 Memory match 0.012 0.028 0.019 Total matching 0.323 0.353 0.333
![Page 34: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/34.jpg)
Conclusions
The system can support other sensors. Attention and categorization act
together: tasks must be formulated Inspection task succesfully done. Currently support a 10-15 frame rate. Reinforcement learning appr. worked
well in simulation
![Page 35: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/35.jpg)
Future works Consider focus for saccade generation and
accomodation (vergence) Test with partially ocluded objects Derive policies (with Q-learning) for control of top-
down attention Increase the state space and/or the set of actions Define other hierarchical tasks (several policies, each
appropriate for a given task) Test learning architecture on a real environment
![Page 36: Reinforcement Learning in the Control of Attention](https://reader035.vdocuments.us/reader035/viewer/2022070419/56815d2c550346895dcb24c6/html5/thumbnails/36.jpg)
Thanks Thanks to CNPQ, CAPES, FAPERJ, NSF
and UMASS (USA)
To all of you for your patience
To Mimmo and Dr. Arcangelo Distante for hosting me:-).