kernel temporal differences for reinforcement...
TRANSCRIPT
KERNEL TEMPORAL DIFFERENCES FOR REINFORCEMENT LEARNING WITHAPPLICATIONS TO BRAIN MACHINE INTERFACES
By
JIHYE BAE
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2013
c⃝ 2013 Jihye Bae
2
I dedicate this to my family for their endless support.
3
ACKNOWLEDGMENTS
I would like to sincerely thank my Ph.D. advisor Prof. Jose C. Principe for his
invaluable guidance, understanding, and patience. It is hard to imagine that I could
complete this program without his continual support. It was a fortune for me to meet
Prof. Principe. Thanks to him, I was able to obtain in-depth knowledge in adaptive signal
processing and information theoretic learning and to have an unforgettable life time
opportunity to enhance my view about research.
I would like to thank Prof. Justin C. Sanchez for enriching my knowledge in
neuroscience and supporting my research. His willingness and openness to collaborate
gave me chances to learn and to conduct practical experiments that have become
an important part of this dissertation. In addition, I want to thank Dr. Sanchez’s lab
members, especially Dr. Eric Pohlmeyer and Dr. Babak Mahmoudi for their help, advise,
and fruitful discussions. I would also like to thank my Ph.D. committee members, Prof.
John G. Harris, Prof. Paul D. Gader, and Prof. Arunava Banerjee for their valuable
comments and the critical feedback about my research.
I was very fortunate for having the opportunity to be part of the Computational
Neuro-Engineering Laboratory (CNEL) at University of Florida. Thanks to CNEL
members, I did not only gain knowledge but also had memories that will remain for life.
I specially thank to my lovely girls, Dr. Lin Li and Dr. Songlin Zhao for their constant
support and wonderful friendship; I will never forget the first day at the University
of Florida and CNEL. You will always be with me in my heart. I also thank Austin
Brockmeier, Evan Kriminger, and Matthew Emigh for the valuable discussions, help,
and for introducing me to the diverse culture of the US. I thank my good old CNEL
friend, Stefan Craciun; I will always miss the energetic and exciting corner. I thank CNEL
alumni, Dr. Erion Hasanbelliu, Dr. Sohan Seth, and Dr. Alexander Singh Alvarado,
for good memories at Greenwich Green and CNEL. I also thank Dr. Divya Agrawal,
Veronica Bolon Canedo, Rakesh Chalasani, Goktug T. Cinar, Rosha Pokharel, Kwansun
4
Cho, Jongmin Lee, Pingping Zhu, Kan Li, In Jun Park, Miguel D. Teixeira, Bilal Fadlallah,
Gavin Philips, Gabriel Nallathambi for their support and good friendship.
Last but not least, I want to thank my family for their encouragement and endless
support including my new family, Dr. Luis Gonzalo Sanchez Giraldo. You are my best
friend, classmate, lab mate, and life time partner. You brought me happiness and faith
and enriched my life both as a researcher and as a person.
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 REINFORCEMENT LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 STATE VALUE FUNCTION ESTIMATION/ POLICY EVALUATION . . . . . . . . 32
3.1 Temporal Difference(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.1 Temporal Difference(λ) in Reinforcement Learning . . . . . . . . . 363.1.2 Convergence of Temporal Difference(λ) . . . . . . . . . . . . . . . 37
3.2 Kernel Temporal Difference(λ) . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.1 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.2 Kernel Temporal Difference(λ) . . . . . . . . . . . . . . . . . . . . . 423.2.3 Convergence of Kernel Temporal Difference(λ) . . . . . . . . . . . 44
3.3 Correntropy Temporal Differences . . . . . . . . . . . . . . . . . . . . . . 463.3.1 Correntropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.2 Maximum Correntropy Criterion . . . . . . . . . . . . . . . . . . . . 483.3.3 Correntropy Temporal Difference . . . . . . . . . . . . . . . . . . . 493.3.4 Correntropy Kernel Temporal Difference . . . . . . . . . . . . . . . 50
4 SIMULATIONS - POLICY EVALUATION . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Linear Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Linear Case - Robustness Assessment . . . . . . . . . . . . . . . . . . . 574.3 Nonlinear Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Nonlinear Case - Robustness Assessment . . . . . . . . . . . . . . . . . 69
5 POLICY IMPROVEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 State-Action-Reward-State-Action . . . . . . . . . . . . . . . . . . . . . . 765.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Q-learning via Kernel Temporal Differences and Correntropy Variants . . 775.4 Reinforcement Learning Brain Machine Interface Based on Q-learning
with Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 80
6
6 SIMULATIONS - POLICY IMPROVEMENT . . . . . . . . . . . . . . . . . . . . 82
6.1 Mountain Car Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2 Two Dimensional Spatial Navigation Task . . . . . . . . . . . . . . . . . . 88
7 PRACTICAL IMPLEMENTATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.1 Open Loop Reinforcement Learning Brain Machine Interface: Q-KTD(λ) . 957.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.1.2 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.1.3 Center-out Reaching Task - Single Step . . . . . . . . . . . . . . . 977.1.4 Center-out Reaching Task - Multi-Step . . . . . . . . . . . . . . . . 102
7.2 Open Loop Reinforcement Learning Brain Machine Interface: Q-CKTD . . 1047.3 Closed Loop Brain Machine Interface Reinforcement Learning . . . . . . 107
7.3.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.3.2 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.3.4 Closed Loop Performance Analysis . . . . . . . . . . . . . . . . . . 113
8 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . 118
APPENDIX
A MERCER’S THEOREM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B QUANTIZATION METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7
LIST OF TABLES
Table page
6-1 The average success rate of Q-KTD and Q-CKTD. . . . . . . . . . . . . . . . . 93
8
LIST OF FIGURES
Figure page
1-1 The decoding structure of reinforcement learning model in a brain machineinterface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2-1 The agent and environment interaction in reinforcement learning. . . . . . . . . 28
3-1 Diagram of adaptive value function estimation in reinforcement learning. . . . . 32
3-2 Contours of CIM(X , 0) in 2 dimensional sample space. . . . . . . . . . . . . . . 48
4-1 A 13 state Markov chain [6] for the linear case. . . . . . . . . . . . . . . . . . . 54
4-2 Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in TD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . 55
4-3 Performance over different kernel sizes in KTD(λ). . . . . . . . . . . . . . . . . 56
4-4 Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in KTD(λ) with h = 0.2. . . . . . . . . . . . . . . . . . 56
4-5 Learning curve of TD(λ) and KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . . 58
4-6 The comparison of state value V (x) in x ∈ X convergence between TD(λ)and KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4-7 The performance of TD for different levels (variances σ2) of additive Gaussiannoise on the rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4-8 The performance change of CTD over different correntropy kernel sizes, hc . . . 61
4-9 Learning curve of TD and CTD when the Gaussian noise with variance σ2 =10 is added to the reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4-10 Performance of CTD corresponding to different correntropy kernel sizes hc ,with mixture of Gaussian noise distribution. . . . . . . . . . . . . . . . . . . . . 62
4-11 Learning curves of TD and CTD when the noise added to the rewards correspondsto a mixture of Gaussians. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4-12 Performance changes of TD with respect to different Laplacian noise variancesb2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4-13 Performance of CTD depending on different correntropy kernel sizes hc withvarious Laplacian noise variances. . . . . . . . . . . . . . . . . . . . . . . . . . 64
4-14 Learning curve of TD and CTD when the Laplacian noise with variance b2 =25 is added to the reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4-15 A 13 state Markov chain for the nonlinear case. . . . . . . . . . . . . . . . . . . 66
9
4-16 The effect of λ and the initial step size η0 in TD(λ). . . . . . . . . . . . . . . . . 66
4-17 The performance of KTD with different kernel sizes. . . . . . . . . . . . . . . . 67
4-18 Performance comparison over different combinations of λ and the initial stepsizeη in KTD(λ) with h = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4-19 Learning curves of TD(λ) and KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . 68
4-20 The comparison of state value convergence between TD(λ) and KTD(λ). . . . 69
4-21 Performances of CKTD depending on the different correntropy kernel sizes. . . 70
4-22 Learning curve of KTD and CKTD. . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-23 The comparison of state value function ~V estimated by KTD and CorrentropyKTD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-24 Mean and standard deviation of RMS error over 100 runs at the 2000th trial. . . 73
4-25 Mean RMS error over 100 runs. Notice this is a log plot in the horizontal axis . 73
5-1 The structure of Q-learning via kernel temporal difference (λ) . . . . . . . . . . 79
5-2 The decoding structure of reinforcment learning model in a brain machine interfaceusing a Q-learning based function approximation algorithm. . . . . . . . . . . . 80
6-1 The Mountain-car task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6-2 Performance of Q-TD(λ) with various combination of λ and η. . . . . . . . . . . 84
6-3 The performance of Q-KTD(λ) with respect to different kernel sizes. . . . . . . 85
6-4 Performance of Q-KTD(λ) with various combination of λ and η. . . . . . . . . . 85
6-5 Relative frequency with respect to average number of iterations per trial ofQ-TD(λ) and Q-KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6-6 Average number of iterations per trial of Q-TD(λ) and Q-KTD(λ). . . . . . . . . 86
6-7 The performance of Q-CKTD with different correntropy kernel sizes. . . . . . . 88
6-8 Average number of steps per trial of Q-KTD and Q-CKTD. . . . . . . . . . . . . 88
6-9 The average success rates over 125 trials and 50 implementations. . . . . . . . 90
6-10 The average final filter sizes over 125 trials and 50 implementations. . . . . . . 91
6-11 Two dimensional state transitions of the first, third, and fifth sets with η = 0.9and λ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10
6-12 The average success rates over 125 trials and 50 implementations with respectto different filter sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6-13 The change of success rates (top) and final filter size (bottom) with ϵU = 5. . . 93
6-14 The change of average success rates by Q-KTD and Q-CKTD. . . . . . . . . . 94
7-1 The center-out reaching task for 8 targets. . . . . . . . . . . . . . . . . . . . . . 96
7-2 The comparison of average learning curves from 50 Monte Carlo runs betweenQ-KTD(0) and MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7-3 The average success rates over 20 epochs and 50 Monte Carlo runs with respectto different filter sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7-4 The comparison of KTD(0) with different final filter sizes and TDNN with 10hidden units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7-5 The effect of filter size control on 8-target single-step center-out reaching task. 101
7-6 The average success rates for various filter sizes. . . . . . . . . . . . . . . . . . 102
7-7 Reward distribution for right target. . . . . . . . . . . . . . . . . . . . . . . . . . 103
7-8 The learning curves for multi step multi target tasks. . . . . . . . . . . . . . . . 104
7-9 Average success rates over 50 runs. . . . . . . . . . . . . . . . . . . . . . . . . 106
7-10 Q-value changes per tiral during 10 epochs. . . . . . . . . . . . . . . . . . . . . 107
7-11 Target index and matching Q-values. . . . . . . . . . . . . . . . . . . . . . . . . 108
7-12 The success rates of each target over 1 through 5 epochs. . . . . . . . . . . . 109
7-13 Performance of Q-learning via KTD in the closed loop RLBMI . . . . . . . . . . 112
7-14 Proposed visualization method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7-15 The estimated Q-values and resulting policy for the projected neural states. . . 116
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
KERNEL TEMPORAL DIFFERENCES FOR REINFORCEMENT LEARNING WITHAPPLICATIONS TO BRAIN MACHINE INTERFACES
By
Jihye Bae
August 2013
Chair: Jose C. PrincipeMajor: Electrical and Computer Engineering
Reinforcement learning brain machine interfaces (RLBMI) have been shown to be
a promising avenue for practical implementations of BMIs. In the RLBMI, a computer
agent and a user in the environment cooperate and learn co-adaptively. An essential
component in the agent is the neural decoder which translates the neural states of
the user into control actions for the external device in the environment. However, to
realize the advantages of the RLBMI in practice, there are several challenges that need
to be addressed. First, the neural decoder must be able to handle high dimensional
neural states containing spatial-temporal information. Second, the mapping from neural
states to actions must be flexible enough without making strong assumptions. Third,
the computational complexity of the decoder should be reasonable such that real time
implementations are feasible. Fourth, it should be robust in the presence of outliers or
perturbations in the environment. We introduce algorithms that take into account these
four issues.
To efficiently handle the high dimensional state spaces, we adopt the temporal
difference (TD) learning which allows the learning of the state value function using
function approximation. For a flexible decoder, we propose the use of kernel base
representations which provides nonlinear extensions of TD(λ) which we call kernel
temporal difference (KTD)(λ). Two key advantages of KTD(λ) are its nonlinear functional
approximation capabilities and convergence guarantees that gracefully emerge as
12
an extension of the convergence results known for linear TD learning algorithms. To
address the robustness issue, we introduce correntropy temporal difference (CTD) and
correntropy kernel temporal difference (CKTD), which is a robust alternative to the mean
square error (MSE) employed by conventional TD learning.
From state value function estimation, all fundamental features of the proposed
algorithms can be observed. However, this is only an intermediate step in finding a
proper policy. Therefore, we extend all proposed TD algorithms to state-action value
function estimation based on Q-learning: Q-learning via correntropy temporal difference
(Q-CTD), Q-KTD(λ), and Q-CKTD. To illustrate the behavior of the proposed algorithms,
we apply them to the problem of finding an optimal policy on simulated sequential
decision making with continuous state spaces. The results show that Q-KTD and
Q-CKTD are able to find a proper control policy and give stable performance with the
appropriate parameters, and that Q-CKTD improves performance in off-policy learning.
Finally, the Q-KTD(λ) and Q-CKTD algorithms are applied to neural decoding in
RLBMIs. First, they are applied in open-loop experiments to find a proper mapping
between a monkey’s neural states and desired positions of a computer cursor or a
robotic arm. The experimental results show that the algorithms can effectively learn
the neural state-action mapping. Moreover, Q-CKTD shows that the optimal policy
can be estimated even without having perfect predictions of the value function with
a discrete set of actions. Q-KTD is also applied to closed-loop RLBMI experiments.
The co-adaptation of the decoder and the subject are observed. Results show that the
algorithm succeeds in finding a proper mapping between neural states and desired
actions. The kernel based representation combined with temporal differences is a
suitable approach to obtain a flexible neural state decoder that can be learned and
adapted online. These observations show the algorithms’ potential advantages in
relevant practical applications of RL.
13
CHAPTER 1INTRODUCTION
Research in Brain machine interfaces (BMIs) is a multidisciplinary effort involving
fields such as neurophysiology and engineering. Developments in this area have a wide
range of applications, especially for subjects with neuromuscular disabilities, for whom
BMIs may become a significant aid. Neural decoding of motor signals is one of the main
tasks that needs to be executed by the BMI.
Neural decoding is a process of extracting information from brain signals. For
example, we can reconstruct a stimulus based on the spike trains produced by certain
neurons in brain. The main goal of neural decoding is to characterize the electrical
activity of groups of neurons, that is, identifying patterns of behavior that correlate with a
given task. This process is a fundamental step towards the design of prosthetic devices
that communicate directly with brain.
Ideas from system theory can be used to frame the decoding problem. Bypassing
the body can be achieved by modelling the transfer function from brain activity to limb
movement and utilizing the output of the properly trained model to control a robotic
device to implement the intention of movement.
Some approaches to the design of neural decoding systems involve machine
learning methods. In order to choose the appropriate learning method, factors such a
learning speed and stability help in determining the usefulness of a particular method.
Supervised learning is commonly applied in BMI [19] because of the tremendous
body of work in system identification. Given a training set of neural signals and
synchronized movements, the problem is to find a mapping between the two which
can be solved by applying supervised learning techniques; the kinematic variables
of an external device is set as desired signals, and the system can be trained to
obtain the regression model. [22] showed that the well known supervised learning
algorithms such as Wiener filter, least mean square adaptive filter, and time delay neural
14
network are able to estimate the mapping from spike trains from the motor cortex to
the kinematic variables of a monkey’s hand movements. [35] applied linear estimation
algorithms including ridge regression and a modified Kalman filter to estimate the cursor
position on a computer screen based on a monkey’s neural activity; the system was
also implemented for closed-loop brain control experiments. In addition, [49] used an
echostate network, which is one type of recurrent neural network, to decode a monkey’s
neural activity in a center-out reach task in closed loop BMIs. Note that when closed
loop BMI experiments are conducted using supervised learning, a pre-trained functional
regression model is applied to estimate the desired kinematic values; after pre-training,
fixed model parameters are applied, and the system does not adapt simultaneously
during the experiments.
Even though the supervised learning approach has been applied to neural decoding
in real time control of BMIs, it is probably not the most appropriate methodology for the
problem because of the absence of ground truth in a paraplegic user who can not move.
In addition, even if the desired signal was available, there are other factors such as brain
plasticity that still limit the functionality of supervised learning since frequent calibration
(retraining) becomes necessary. In BMIs, it is necessary to have direct communication
between the central nervous system and the computer that controls an external device
such as a prosthetic arm for disabled individuals. Thus, methods that can adapt and
adjust to subtle neural variations are preferred.
When we frame neural decoding as a sequential decision making problem, dynamic
programming (DP) is a classical approach to solve such problems. In sequential
decision making problems, there is a dynamic system whose evolution is affected by
the decisions being made. The goal is to find a decision making rule (feedback policy)
that optimizes a given performance criterion. However, DP has the following drawbacks:
It assumes all model components including the dynamics and the environment are
known, and all states are fully observable. However, in many practical applications, the
15
above conditions are often unsatisfied. In addition, to find an optimal decision maker, DP
requires the evaluation of all the states and controls. This results in high computational
demands when the problem dimension scales up (Bellman’s curse of dimensionality).
Furthermore, direct modelling is rather difficult since there are many factors that need
to be accounted for even within the same task and subject. Although the theoretical
foundation of reinforcement learning (RL) is drawn from dynamic programming (DP),
RL addresses the drawbacks of dynamic programming because it allows us to achieve
an approximation of the optimal value functions of DP without explicitly knowing the
environment.
On the other hand, RL is one of the representative learning schemes (supervised
and unsupervised learning) which provides a general framework for adapting a system
to a novel environment. RL differs from the other learning schemes in the sense that
RL not only observes but also interacts with the environment to collect the information.
Also, RL receives reward information from the environment which is frequently delayed
by unspecified time amounts. Thus, RL is considered the most realistic class of learning
and is rich with many algorithms for on-line learning with low computational complexity.
Reinforcement learning (RL) algorithms are a general framework for system adaptation
to a novel environments; this characteristic is similar to the way biological organisms
interact with environment and learn from experience. In RL, it is possible to learn
only with information from the environment, and thus the need for a desired signal
is suppressed. Therefore, RL is well suited for the neural decoding stage of a BMI
application.
A BMI architecture based on reinforcement learning (RLBMI) is introduced in [13],
and successful applications of this approach can be found in [1, 30, 37]. In the RLBMI
architecture, there are two intelligent systems: the BMI decoder in the agent, and the
user in the environment. The two intelligent systems learn co-adaptively based on
closed loop feedback (Figure 1-1). The agent updates the state of the environment,
16
namely, the position of a cursor on a screen or a robot’s arm position, based on the
user’s neural activity and the received rewards. At the same time, the subject produces
the corresponding brain activity. Through iterations, both systems learn how to earn
rewards based on their joint behavior. The BMI decoder learns a control strategy based
on the user’s neural state and performs actions in goal directed tests that update the
state of the external device in the environment. In addition, the user learns the task
based on the state of the external device. Notice that both systems act symbiotically
by sharing the external device to complete their tasks, and this co-adaptation allows
for continuous synergistic adaptation between the BMI decoder and the user even in
changing environments.
Figure 1-1. The decoding structure of reinforcement learning model in a brain machineinterface.
Note that in the agent, the proper neural decoding of the motor signals is essential
to control the external device that interacts with the physical environment. However,
there are several challenges that must be addressed in practical implementations of
RLBMI:
1. High dimensional input state spacesAlgorithms must be able to readily handle high dimensional state spaces thatcorrespond to the neural state representations.
2. Nonlinear mappingsThe mapping from neural states to actions must be flexible enough to handlenonlinear mappings yet making little assumptions.
3. Computational complexity
17
Algorithms should execute with a reasonable amount of time and resources thatallow them to perform control actions in real time.
4. RobustnessThe algorithms should handle cases where assumptions may not hold, e.g. thepresence of outliers or perturbations in the environment.
In this dissertation, we introduce algorithms that take into account the aforementioned
issues.
RL learns optimal control policies (a map from states to actions) by observing
the interaction of a learning agent with the environment. At each step, the decision
maker decides an action given a state from a system (environment) to generate
desirable states. Over time, the controller (agent) learns by interacting with the system
(environment) while maximizing a quantity known as total reward. The aim of learning is
to derive the optimal control policies to bring the desired behavior into the system, and
the optimality is assessed in terms of the expected total reward known as value function.
Therefore, estimating the value function is a fundamental and crucial algorithmic
component in reinforcement learning problems.
Temporal difference (TD) learning is a method that can be applied to approximate
value functions through incremental computation directly from new experience without
having an associated model of environment. This allows us to efficiently handle high
dimensional states and actions by using adaptive functional approximators, which can
be trained directly from the data. TD algorithms approximate the value function based
on the difference between two estimations corresponding to subsequent inputs in time
(temporal difference error).
The introduction of the TD(λ) algorithm in [50] revived the interest of TD learning
in the RL community. Here, λ represent an eligibility trace rate which is added to the
averaging process over temporal differences to put emphasis on the most recent
observed states and to efficiently deal with the delayed reward. TD(λ) [50] is the
fundamental algorithm used to estimate state value functions, which can be utilized to
18
compute an approximate solution to Bellman’s equation using parametrized functions.
Because TD learning allows system updates directly from the sequence of states, online
learning becomes possible without having a desired signal at all times. For a majority
of real world prediction problems, TD learning has lower memory and computational
demands than supervised learning [50].
Since TD(λ) updates the value function whenever any state transitions are
observed, this may cause inefficient use of data. In addition, the manual selection
of optimal parameters (stepsize and the eligibility trace rate) is still required. A poor
choice of the stepsize and the eligibility trace parameters can cause a dramatically slow
convergence rate or an unstable system. TD(λ) is also sensitive to the distance between
optimal and initial parameters. However, it is popularly applied because of its simplicity
and ability to be used for online learning in multistep prediction problems.
To avoid the possibility of poor performance due to improper choice of the stepsize
and the initialization of parameters in TD(λ), the least squares TD (LSTD) and recursive
least squares TD (RLSTD) were introduced in [8]. Subsequently, an extension to
arbitrary values of λ, LSTD(λ), was proposed in [5]. However, in comparison to TD
(O(d)), LSTD and RLSTD have increased computational complexity per update: O(d3)
and O(d2) respectively, where d is the dimensionality of the state representation space.
The necessity of addressing computational efficiency has stimulated further interest
in online learning. Incremental least squares TD learning called iLSTD, which achieves
per-time-step complexities of O(d), was introduced in [15], and its theoretical analysis
extended to iLSTD(λ) can be found in [16]. This iLSTD uses a similar approach to
RLSTD, but to update the system and keep a low computational load, it only modifies
a single dimension of weight that corresponds to the largest TD update. However,
theoretical analysis shows that convergence cannot be guaranteed under this greedy
approach, and modifications that guarantee convergence increase the computational
cost dramatically. This makes the above algorithm unattractive for online learning.
19
Even though all the above methods provide their own advantages such as
convergence, stability, or learning rate, they are limited to parametrized linear function
approximation which may not be as flexible especially in practical applications where
little prior knowledge can be incorporated. The importance of finding a proper functional
space turns our interest towards nonlinear models which are generally more flexible.
Nonlinear variants of TD algorithms have also been proposed. However, they are mostly
based on time delay neural networks, sigmoidal multilayer perceptrons, or radial basis
function networks. Despite their good approximation capabilities, these algorithms are
usually prone to fall into local minima [3, 7, 20, 54] turning training into an art.
There has been a growing interest in a class of learning algorithms that have
nonlinear approximation capabilities and yet allow cost functions that are convex. They
are known as kernel based learning algorithms [44]. One of the major appeals of kernel
methods is the ability to handle nonlinear operations on the data by indirectly using an
underlying nonlinear mapping to a so called feature space (Reproducing Kernel Hilbert
Space (RKHS)) which is endowed with an inner product. A linear operation in the RKHS
corresponds to a nonlinear operation in the input space; for some kernel functions these
properties can lead to universal approximation of functions on the input space. Many
of the related optimization problems can be posed as convex (no local minima) with
algorithms that are still reasonably easy to compute (using the kernel trick [44]).
Recent work in adaptive filtering has shown the usefulness of kernel methods in
solving nonlinear adaptive filtering problems [20, 28]. Successful applications of the
kernel-based approach in supervised learning are well known through support vector
machines (SVM) [4], kernel least squares (KLS) [43], and Gaussian processes (GP)
[40]. Kernel-based learning has also been successfully integrated into reinforcement
learning [11, 12, 14, 17, 39, 57] demonstrating their potential advantages in this context.
Furthermore, kernel methods have been integrated with temporal difference
algorithms showing superior performance in nonlinear approximation problems. The
20
close relation between Gaussian processes and kernel recursive least squares was
exploited in [14] to bring the Bayesian framework into TD learning. Gaussian process
temporal difference uses kernels in probabilistic discriminative models based on
Gaussian processes, incorporating parameters such as variance of the observation
noise and providing predictive distributions (posterior variance) to evaluate predictions.
Similar work using kernel-based least squares temporal difference learning with
eligibilities called KLSTD(λ) was introduced in [58]. Unlike GPTD, KLSTD(λ) does
not use a probabilistic approach. The idea in KLSTD is to extend LSTD(λ) [5] using
the concept of duality. However, KLSTD(λ) uses a batch update, so its computational
complexity per time update is O(n3) which is not practical for online learning.
Here, we will investigate TD learning integrated with kernel methods, which we call
kernel temporal difference (KTD) (λ) [1, 2]. We adopt a learning approach based on
stochastic gradient methods, which is very popular in adaptive filtering. When combined
with kernel methods, the stochastic gradient can reduce the computational complexity
to O(n). Namely, we show how KTD(λ) can be derived from the kernel least mean
square (KLMS) [27] algorithm. Although the standard setting in supervised learning
differs from RL, since RL does not use explicit information from a desired signal at every
sample, elements such as the adaptive gain and the approximation error terms can be
well exploited in solving RL problems [50]. KTD shares many features with the KLMS
algorithm [27] except that the error is now obtained using the temporal differences,
i.e. the difference of consecutive outputs is used as the error guiding the adaptation
process.
Online KTD(λ) is well suited for nonlinear function approximation. It avoids some
of the main issues such as local minima or proper initialization that are common
in other nonlinear function approximation methods. In addition, based on the dual
representation, we can show other implicit advantages of using kernels. For instance,
universal kernels automatically satisfy one of the conditions for convergence of TD(λ).
21
Namely, linearly independent representations of states are obtained through the implicit
mapping associated with the kernel.
Even though this non-parametric technique requires a high computational cost that
comes with the inherently growing structure, when the problem is highly complicated
and requires a large amount of data, these techniques produce better solutions than
any other simple linear function approximation methods. In addition, as we will see in
this work, there are methods that we can employ to overcome scalability issues such as
growing filter sizes [9, 25].
In practice, it is common to face the situation where assumptions about noise or the
model deviate from standard considerations. For example, outliers, which result from
unexpected perturbations such as noisy state representations, transitions, or rewards,
can be difficult to be accounted for. In such cases, the controller may fail to obtain the
desired behavior. To the best of our knowledge, no study has addressed the issue of
how noise or small perturbations to the model affect performance in TD learning. Most
studies on TD algorithms focus on synthetic experiments such as simulated Markov
chains or random walk problems.
In our work, we investigate the maximum correntropy criterion (MCC) as an
objective function [38] that aims at coping with the above mention difficulty. Correntropy
is a generalized correlation measure between two random variables first introduced in
[42]. It has been shown that correntropy is useful in non-Gaussian signal processing
[26] and effective for many applications under noisy environments [18, 21, 36, 45, 47].
Correntropy can be applied as a cost function, resulting in the maximum correntropy
criterion (MCC). A system can be adapted in such a way that the similarity between
desired and predicted signals is maximized. MCC serves as an alternative to MSE that
uses higher order information, which makes it applicable to cases where Gaussianity
and linearity assumptions do not necessarily hold.
22
MCC has been applied to obtain robust methods for adaptive systems in supervised
learning [45, 46, 59]. In particular, an interesting blend between KLMS and maximum
correntropy criterion (MCC) was proposed in [59]. The basic idea of kernel maximum
correntropy (KMC) is that input data is transferred to an RKHS using a nonlinear
mapping function, and the maximum correntropy criterion (MCC) is applied as a cost
function to minimize the error. It was shown that the KMC accurately approximates
nonlinear systems, and it was able to reduce the detrimental effects of various types of
noise in comparison to the conventional MSE criterion.
We will show how KMC can be incorporated into TD learning. Correntropy kernel
temporal difference (CKTD) can be derived in a similar way to TD learning when posed
as a supervised learning problem. As a result, we obtain a correntropy temporal
difference (CTD) algorithm, which extends TD and KTD algorithms to the robust
maximum correntropy criterion.
Note that the TD algorithms we have studied are introduced for state value
function estimation given a fixed policy. To solve complete RL problems, the algorithms
should allow the construction of near optimal policies. We want to find the optimal
state-action mapping (policy) by maximizing a cumulative reward, and this mapping
can be exclusively determined by the estimated state-action value function because it
quantifies the relative desirability of different state and action pairs.
Actor-Critic is one way to find an optimal policy based on the estimated action value
function. This is a well-known method that combines the advantages of policy gradient
and value function approximation. The Actor-Critic method contains two separate
systems (actor and critic), and each one of the systems is updated based on the other.
The actor controls the policy to select actions, and the critic estimates the value function.
Thus, after each action is selected from the given policy by the actor, the critic evaluates
the policy using the estimated value function. In [23], it is shown how TD algorithms can
be applied to the critic to estimate the value function, while the policy gradient method
23
is applied to update the actor. Based on the gradient of the value function obtained
from the critic, the policy in the actor is updated. The critic evaluates the value function
given a fixed policy from the actor. However, since the Actor-Critic method includes two
systems, it is challenging to adjust them simultaneously.
On the other hand, Q-learning [55] is a simple online learning method to find an
optimal policy based on the action value function Q. Despite being a simple approach,
Q-learning is commonly used because it is effective, and the agent can be updated
based solely on observations. The basic idea of Q-learning is that when the action value
Q is close to the optimal action value Q∗, the policy, which is greedy with respect to all
action values for a given state, is close to optimal.
Therefore, we can think of extending the proposed TD algorithms (KTD(λ), CTD,
and CKTD) to approximate Q-functions from which we can derive the optimal policy.
In particular, we will introduce Q-CTD, Q-KTD, and Q-CKTD algorithms. Q-learning is
a well known off-policy TD control algorithm; the form of state-action mapping function
(policy) is undetermined, and TD learning is applied to estimate the state-action value
function.
The convergence of Q-learning with function approximation has been a main
concern in its application [51]. [54] showed that it is possible to diverge in Q-learning
with nonlinear function approximation. In addition, [3] pointed out that value based RL
algorithms can become unstable when combined with function approximation.
Despite the above issues, [32] showed convergence properties of Q-learning with
linear function approximation under restricted conditions. Furthermore, the extension
of the gradient temporal difference (GTD) family of learning algorithms to Q-learning
called Greedy GQ [29] results in better convergence properties; the system converges
independently of the sampling distribution. However, Greedy GQ may get stuck in
local minima even with linear function aproximation because the objective function is
non-convex.
24
Although [29, 32] showed the feasibility of applying Q-learning with linear function
approximation, the use of a nonlinear function approximator in Q-learning has not
yet been actively considered mainly because of the lack of convergence guarantees.
However, incorporating the kernel-based representation may bring the advantages of
nonlinear function approximation, and the convergence properties of linear function
approximation in Q-learning would still hold. A convergence result for Q-learning using
linear function approximation by temporal difference (TD)(λ) [50] is introduced in [32].
They proved that when the learning policy and the greedy policy are close enough,
the algorithm converges to a fixed point of a recursion based on the bellman operator
they introduced in [32]. Their convergence result is based on a relation between the
autocorrelation matrices of the basis functions with respect to the learning policy and
the greedy policy. In addition, they assume a compact state space with a finite set of
bounded linearly independent basis functions. In Q-KTD, the representation space is
possibly infinitely dimensional. Therefore, the direct extension of the results from [32]
would require an extended version of the ordinary differential equation (ODE) method to
a Hilbert space valued differential equation.
Since policy is not fixed in Q-learning, it is required that the system explores
the environment and learns under changing policies. The system should respond
accordingly and be able to disregard large changes that may result from exploration. To
address this problem we explore robustness through the maximum correntropy criterion
in the context of changing policies.
As we mentioned above, one of the practical objectives of our work is to apply the
proposed TD algorithms to neural decoding within the reinforcement learning brain
machine interface framework. In the RLBMI structure, the agent learns how to translate
the neural states into actions (direction) based on predefined reward values from the
environment. Since there are two intelligent systems, a BMI decoder in agent and a
BMI user in environment, in closed loop feedback, we can understand the system as a
25
cooperative game. In fact, the BMI user has no direct access to actions, and the agent
must interpret the user’s brain activity correctly to facilitate the rewards [13].
Therefore, the proposed algorithms can be applied to the agent, which decodes
the neural states transforming them to the proper action directions that are in turn
executed by an external device such as a computer screen or a robotic arm. The
updated position of the actuator will influence the user’s subsequent neural states
because of the visual feedback involved in the process. That is how the two intelligent
systems learn co-adaptively and the closed loop feedback is created. In other words,
the input to the BMI decoder is the user’s neural states, which can be considered as the
user’s output. Likewise, the action directions of the external device are the decoder’s
output and because of the visual feedback they can also be considered as the input to
the user.
We will exam the capability of the Q-KTD algorithm both in open and closed
loop Reinforcement Learning Brain Machine Interfaces (RLBMI) to perform reaching
tasks. The closed loop RLBMI experiment will show how the two intelligent systems
co-adaptively learn in a real time reaching task. Note that Q-learning via KTD (Q-KTD)
is powerful in practical applications due to its nonlinear approximation capabilities.
Also, this algorithm is advantageous for real time applications since parameters can be
chosen on the fly based on the observed input states, and no normalization is required.
In addition, we will see the performance of Q-learning via correntropy KTD (Q-CKTD)
in open loop RLBMI experiments, and see how correntropy can improve performance
under changing policies.
The main contribution of this thesis are the three new state value function
approximation algorithms based on temporal difference algorithms: kernel temporal
difference(λ), correntropy temporal difference, and correntropy kernel temporal
difference. The proposed algorithms are extended to find a control policy in reinforcement
learning problems based on Q-learning, and this leads the Q-learning via kernel
26
temporal difference (Q-KTD)(λ), Q-CTD, and Q-CKTD. Moreover, we provide a
theoretical analysis on the convergence and degree of sub-optimality of the proposed
algorithms based on the extension of existing results to the TD algorithm and its
Q-learning counterpart. Furthermore, we test the algorithms to illustrate their behavior
and overall performance both in state value function approximation and policy estimation
problems. Finally, we apply the proposed algorithms to RLBMI showing how the
developed methodology can be useful in relevant practical scenarios.
27
CHAPTER 2REINFORCEMENT LEARNING
We will show the background of RL including the mathematical formulation of the
value function in Markov decision processes, and in the following chapters, we will see
how the temporal differences can be derived for value function estimation and applied to
RL algorithms.
In reinforcement learning, a controller (agent) interacts with a system (environment)
over time and modifies its behavior to improve performance. This performance is
assessed in terms of cumulative rewards, which are assigned based on a task goal. In
RL the agent tries to adjust its behavior by taking actions that will increase the reward
in the long run; these actions are directed towards the accomplishment of the task goal
(Figure 2-1).
Figure 2-1. The agent and environment interaction in reinforcement learning.
Assuming the environment is a stochastic (that is, if a certain state is visited
different times and the same action is taken, the following state may not be the same
each time) and stationary process that satisfies the Markov condition
P(x(n)|x(n − 1), x(n − 2), · · · , x(0)) = P(x(n)|x(n − 1)), (2–1)
it is possible to model the interaction between the learning agent and the environment
as a Markov decision process (MDP). For the sake of simplicity, we assume the states
and actions are discrete, but they can also be continuous. A Markov decision process
(MDP) consists of the following elements:
28
• x(n) ∈ X : states
• a(n) ∈ A : actions
• Raxx ′ : (X × A) × X 7→ R : reward function over states x ′ ∈ X given a state action
pair (x , a) ∈ X ×A,
Raxx ′ = E [r(n + 1)|x(n) = x , a(n) = a, x(n + 1) = x ′] . (2–2)
• Paxx ′ : state transition probability that gives a probability distribution over states X
given a state action pair in X ×A,
Paxx ′ = P(x(n + 1) = x ′|x(n) = x , a(n) = a). (2–3)
At time step n, the agent receives the representation of the environment’s state
x(n) ∈ X as input, and according to this input the agent selects an action a(n) ∈ A. By
performing the selected action a(n), the agent receives a reward r(n + 1) ∈ R, and the
state of the environment changes from x(n) to x(n + 1). The new state x(n + 1) follows
the state transition probability Paxx ′ given the action a(n) and the current state x(n). At
the new state x(n + 1), the process repeats; the agent takes an action a(n + 1), and
this will result in a reward r(n + 2) and a state transition from x(n + 1) to x(n + 2). This
process continues either indefinitely or until a terminal state is reached depending on the
process.
There are two important concepts associated with the agent: the policy and value
functions.
• Policy π : X → A is a function that maps a state x(n) to an action a(n).
• The value function is a measure of long-term performance of an agent following apolicy π starting from a state x(n),
State value function : V π(x(n)) = Eπ [R(n)| x(n)] (2–4)Action value function : Qπ(x(n), a(n)) = Eπ [R(n)| x(n), a(n)] (2–5)
where R(n) is a return.
29
A common choice for the return is the infinite-horizon discounted model
R(n) =
∞∑k=0
γkr(n + k + 1) , 0 < γ < 1 (2–6)
that takes into account the rewards in the long run, but weights them with a discount
factor that prevents the function from growing unbounded as k → ∞ and also provides
mathematical tractability [52].
The objective of RL is to find a good policy that maximizes the expected reward of
all future actions given the current knowledge. Since the value function represents the
expected cumulative reward given a policy, the optimal policy π∗ can be obtained based
on the value function; a policy π is better than another policy π′ when the policy π gives
greater expected return than the policy π′. In other words, π ≥ π′ when V π(x) ≥ V π′(x)
or Qπ(x , a) ≥ Qπ′(x , a) for all x ∈ X and a ∈ A. Therefore, the optimal state value
function V π∗ is defined by V π∗(x(n)) = maxπ V
π(x(n)), and the optimal action value
function Qπ∗ can be obtained by Qπ∗(x(n), a(n)) = maxπ Q
π(x(n), a(n)).
When we have complete knowledge of Raxx ′ and Pa
xx ′, an optimal policy π∗ can be
directly computed using the definition of the value function. For a given policy π, the
state value function V π can be expressed as,
V π(x(n)) = Eπ [R(n)| x(n)] (2–7)
= Eπ
[∞∑k=0
γkr(n + k + 1)
∣∣∣∣∣ x(n)]
(2–8)
= Eπ
[r(n + 1) + γ
∞∑k=0
γkr(n + k + 2)
∣∣∣∣∣ x(n)]
(2–9)
=∑a
π(x(n), a(n))∑x ′
Paxx ′
[Ra
xx ′ + γEπ
[∞∑k=0
γkr(n + k + 2)
∣∣∣∣∣ x ′]]
(2–10)
=∑a
π(x(n), a(n))∑x ′
Paxx ′ [Ra
xx ′ + γV π(x ′)] (2–11)
30
The optimal policy π∗ is obtained by selecting an action a(n) satisfying
V π∗(x) = max
a
∑x ′
Paxx ′
[Ra
xx ′ + γV π∗(x ′)
]. (2–12)
Equation (2–12) is commonly known as the Bellman optimality equation for V π∗ . For the
action value function Qπ∗, the optimality equation can be obtained in a similar fashion
Qπ∗(x , a) =
∑x ′
Paxx ′
[Ra
xx ′ + γmaxa′
Qπ∗(x ′, a′)
]. (2–13)
The solution to the equations (2–12) and (2–13) can be obtained using dynamic
programming (DP) methods. However, this procedure is infeasible when the number
of variables increases due to the exponential growth of the state space (curse of
dimensionality) [52]. RL allows us to find policies which approach the Bellman optimal
policies without explicit knowledge of the environment (Paxx ′ and Ra
xx ′); as we will see
in the following chapter, in reinforcement learning, temporal difference (TD) algorithms
approximate the value functions by learning the parameters using simulations rather
than using the explicit state transition probability Paxx ′ and reward function Pa
xx ′. The
estimated value functions will allow comparisons between policies and thus guide the
optimal policy search.
In this chapter we checked the learning paradigm and basic components of
reinforcement learning. The interaction between agent and environment is an important
feature in RL, and policy and value functions are key concepts in the agent providing the
control; based on the value function, a proper policy can be obtained.
31
CHAPTER 3STATE VALUE FUNCTION ESTIMATION/ POLICY EVALUATION
Value function estimation is an important sub-problem in finding an optimal policy in
reinforcement learning. In this chapter, we will introduce three new temporal difference
algorithms: kernel temporal difference (KTD)(λ), correntropy temporal difference (CTD),
and correntropy kernel temporal difference (CTD). The algorithms will be extended
based on the conventional temporal difference (TD) algorithm called TD(λ), which is a
representative online learning algorithm to estimate the value function.
All of the algorithms listed above use temporal difference (TD) error to update the
system, and given a fixed policy π, the optimal value function can be estimated based
on the TD error. Figure 3-1 shows how the value function can be estimated using an
adaptive system based on the TD error. In an adaptive system, there are two important
elements: the learning algorithm concerned with the class of functions that can be
approximated by the system and the cost function which quantifies the fitness of the
function approximations.
Figure 3-1. Diagram of adaptive value function estimation in reinforcement learning.Given a fixed policy, value function can be estimated based on temporaldifference error.
We propose using a kernel framework for the mapper. The implicit linear mapping
in a kernel space can provide universal approximation in the input space, and many
of the related optimization problems can be posed as convex (no local minima) with
algorithms that are still reasonably easy to compute (using the kernel trick [44]). In
32
addition, we apply correntropy as a cost function to find the optimal solution. Correntropy
is a robust similarity measure between two random variables or signals when heavy
tailed or non-Gaussian distributions are involved [21, 36, 45].
3.1 Temporal Difference(λ)
Temporal difference learning is an incremental learning method specialized for
prediction problems, and it provides an efficient learning procedure that can be applied
to reinforcement learning. In particular, TD learning allows learning directly from new
experience without having a model of the environment. It employs previous estimations
to provide updates to the current predictor.
In [50], the TD(λ) algorithm is derived as the solution to a multi-step prediction
problem. For a multi-step prediction problem, we have a sequence of input-output pairs
(x(1), d(1)), (x(2), d(2)), · · · , (x(m), d(m)), in which the desired output d can only be
observed at time m + 1. Then, a system will produce a sequence of predictions y(1),
y(2), · · · , y(m) based solely on the observed input sequences. In general, the predicted
output is a function of all previous inputs,
y(n) = f (x(1), x(2), · · · , x(n)); (3–1)
here, we assume that y(n) = f (x(n)) for simplicity. The predictor f can be defined
based on a set of parameters w , that is,
y(n) = f (x(n),w). (3–2)
Writing the multi-step prediction problem as a supervised learning problem, the
input-output pairs become (x(1), d), (x(2), d), · · · , (x(m), d), and the update rule at each
time step can be written as,
�wn = η(d − y(n))∇wy(n), (3–3)
33
where, η is the learning rate, and the gradient vector ∇wy(n) contains the partial
derivatives of y(n) with respect to w . As we mentioned above, the desired value of the
prediction d only becomes available at time m + 1, and thus the parameter vector w can
only be updated after m time steps. The update is given by the following expression
w ← w +
m∑n=1
�wn. (3–4)
When the predicted output y(n) is a linear function of x(n), we can write the predictor as
y(n) = w ᵀx(n), for which ∇wy(n) = x(n), and the update rule becomes
�wn = η(d − w ᵀx(n))x(n). (3–5)
The key observation to extend the supervised learning approach to the TD method is
that the difference between desired and predicted output at time n can be written as
d − y(n) =
m∑k=n
(y(k + 1)− y(k)) (3–6)
where y(m+1) , d . Using this expansion in terms of the differences between sequential
predictions, we can update the system at each time step. The TD update rule is derived
as follows:
w ← w +
m∑n=1
�wn (3–7)
= w + η
m∑n=1
(d − y(n))∇wy(n) (3–8)
= w + ηm∑n=1
m∑k=n
(y(k + 1)− y(k))∇wy(n) (3–9)
= w + η
m∑k=1
k∑n=1
(y(k + 1)− y(k))∇wy(n) (3–10)
= w + ηm∑n=1
(y(n + 1)− y(n))
n∑k=1
∇wy(k) (3–11)
34
In this case, all predictions are used equally. By using exponential weighting on
recency, we can emphasize more recent predictions, and this yields the following update
rule, which is called the eligibility trace;
�wn = η(y(n + 1)− y(n))
n∑k=1
λn−k∇wy(k). (3–12)
The eligibility trace is a common method used in RL to deal with delayed reward; it
allows propagating the rewards backward over the current state without remembering
the trajectory explicitly. Expression (3–12) is known as the TD(λ) update rule [50], and
the difference between predictions of sequential inputs is called TD error
eTD(n) = y(n + 1)− y(n). (3–13)
Note that when λ = 0, the update rule becomes as
w ← w + ηm∑n=1
(y(n + 1)− y(n))x(n), (3–14)
and this is the same form as LMS expect that the error term is substituted by the
incremental difference in the outputs for the error term.
In supervised learning, the predictor can only be updated once the error (difference
between predicted output and desired signal) is available. Therefore, in the multi-step
prediction problem, the system could not be updated until the error was available at the
reward time, which becomes available only in the future at time m + 1. In contrast, the
TD algorithm allows system updates directly from the sequence of states. Therefore,
online learning becomes possible without having the desired signal available at all times.
This allows efficient learning in most real world prediction problems; TD learning has
lower memory and computational demands than supervised learning, and empirical
results show that TD(λ) can provide more accurate predictions [50].
35
3.1.1 Temporal Difference(λ) in Reinforcement Learning
Now, let us see how the TD(λ) algorithm is employed in RL. When we consider the
prediction y as state value function V π given a fixed policy π, TD(λ) can approximate the
state value function ~V π using a parametrized family of functions of the form
~V (x(n)) = w⊤x(n) (3–15)
with parameter vector w ∈ Rd . For convenience, we use V to denote V π unless we need
to indicate different policies. Note that the objective of TD(λ) is to minimize the mean
square error (MSE) criterion,
minE[(V (x(n))− ~V (x(n)))2
]. (3–16)
Based on (2–9), we can obtain an approximate form of the recursion involved in the
Bellman equation as follows:
~V (x(n)) ≈ r(n + 1) + γ ~V (x(n + 1)). (3–17)
Thus, the TD error at time n (3–13) can be associated with the following expression
eTD(n) = r(n + 1) + γV (x(n + 1))− V (x(n)), (3–18)
and the error term (3–18) combined with (3–12) gives us the following per time-step
update rule;
�wn = η(r(n + 1) + γV (x(n + 1))− V (x(n)))
n∑k=1
λn−k∇wV (k). (3–19)
Algorithm 1 shows pseudo code for the implementation of the TD(λ) algorithm for
linear value function approximation. The algorithm assumes the following information to
be given:
• a fixed policy π in MDP
• a discount factor γ ∈ [0, 1]
36
• a parameter λ ∈ [0, 1]
• a sequence of stepsizes η1, η2, · · · for incremental coefficient updating
Algorithm 1 pseudo code of TD(λ) algorithm in reinforcement learningSet w = 0 (or an arbitrary estimate)Set n = 1for n ≥ 1 doz(n) = x(n), where x(n) ∈ X is a start statewhile (x(n) ̸= terminal state) do
Simulate one step process producing reward r(n + 1) and next state x(n + 1)�wn = (r(n + 1) + γw ᵀx(n + 1)− w ᵀx(n))z(n)w ← w + ηn�wn
z(n + 1) = λz(n) + x(n + 1)n = n + 1
end whileend for
At each state transition, the algorithm computes one step TD error r(n + 1) +
γw ᵀx(n + 1) − w ᵀx(n), and depending on the eligibilities z(n) =∑n
k=n0λn−kx(k), a
portion of TD error is propagated back to update the system.
3.1.2 Convergence of Temporal Difference(λ)
We will see that in the case of λ = 0 and λ = 1, the TD solutions converge
asymptotically to the ideal solution under given conditions in an absorbing Markov
processes. For all other cases λ ̸= 1, the solution also converges, but the answer
is different in general from the one given by the least mean squares algorithm.
Remember that the conventional TD algorithm assumes that the function class is
linearly parametrized satisfying y = w⊤x . This assumption will be also considered in the
convergence proof for TD with any 0 ≤ λ ≤ 1.
• λ = 1 case
The TD(1) procedure is equivalent to (3–11), and this gives the same per-sequence
weight changes as the supervised learning method since (3–11) is derived by directly
replacing the error term in supervised learning using (3–6) (Theorem 3.1).
37
Theorem 3.1. On multistep prediction problems, the linear TD(1) procedure produces
the same pre-sequence weight changes as the Widrow-Hoff rule [50].
• λ = 0 case
The convergence result for linear TD(0) presented in [50] is proved under the assumption
that the dynamic system providing the states correspond to an absorbing Markov
process. In an absorbing Markov process, there is a set of terminal states T , a set of
non-terminal states N , and transition probabilities pij where i ∈ N , j ∈ N ∪ T . The
transition probabilities are set such that a terminal state will be visited in a finite number
of state transitions. Here, we assume that an initial state is selected with probability µi
among non-terminal states.
Given the initial state x(i), an absorbing Markov process generates a state
sequence x(i), x(i + 1), · · · , x(j) where x(j) ∈ T . At the terminal state x(j), the
output d is selected from an arbitrary probability distribution with expected value �dj .
This absorbing Markov process converges to the desired behavior asymptotically based
on experience. Here, the desired behavior is to map each non-terminal state x(i) to
the expected outcome d given the sequence starting from i ; thus, the ideal predictions
y(i) = f (x(i),w) should be equal to E [d |x(i)], ∀i ∈ N . For a completed sequence, we
have the following relation
E [d |x(i)] =∑j∈T
pij�dj +∑j∈N
pij∑k∈T
pjk�dk +∑j∈N
pij∑k∈N
pjk∑l∈T
pkl�dl + · · · (3–20)
=
[∞∑k=0
Qkh
]i
(3–21)
=[(I −Q)−1h
]i, (3–22)
where [Q]ij = pij for i , j ∈ N , and [h]i =∑
j∈T pij�dj for i ∈ N .
The following theorem shows that TD(0) converges to the ideal predictions for the
appropriate step size when the states {x(i)|i ∈ N} are linearly independent.
38
Theorem 3.2. For any absorbing Markov chain, for any distribution of starting probability
µi , for any outcome distributions with finite expected values �dj , and for any linearly
independent set of observation vectors {x(i)|i ∈ N}, there exists an ϵ > 0 such that, for
all positive η < ϵ and for any initial weight vector, the predictions of linear TD(0) converge
in expected value to the ideal predictions. That is, if wn denotes the weight vector after n
sequences have been experienced, then limn→∞ E [w⊤n x(i)] = E [d |x(i)] = [(I − Q)−1h]i ,
∀i ∈ N [50].
• 0 < λ < 1 case
The work in [10] extended the convergence of TD to general λ. The TD update rule
(3–11) can be expressed as
�w sn+1 = �w s
n + ηXD[QsX⊤ �w sn − X⊤ �w s
n + (Qs−1 +Qs−2 + · · ·+ I )h], (3–23)
where �w are the expected weights, X is a state matrix defined as [X ]ab = [xa]b, where
a runs over the states and b on the dimensions. D is a diagonal matrix satisfying
[D]ab = δabda, where δab is the Kronecker delta and da is expected number of times the
Markov chain is in state xa in one sequence, and s shows the number of state transitions
being traced.
After multiplying X⊤ on both sides of (3–23), when we reorganize the equation
using (I +Q +Q2 + · · ·+Qs−1)h = (I −Qs)E [d |x(i)] and �wλn = (1− λ)
∑∞s=1 λ
s−1 �w sn , we
can obtain the following equation
X⊤ �wλn+1 = X⊤ �wλ
n − ηX⊤XD[I − (1− λ)Q(I − λQ)−1](X⊤ �wλn − E [d |x(i)]). (3–24)
When the state representations are linearly independent, X has a full rank, and the right
term of (3–24)
− X⊤XD[I − (1− λ)Q(I − λQ)−1](X⊤ �wn − E [d |x(i)]) (3–25)
39
has a full set of nonzero eigenvalues whose real parts are negative. Therefore, if the
above conditions hold, it can be shown that TD(λ) converges with probability 1 using
Theorem 3.3 [24].
Theorem 3.3. Let {y(n)} be given by
y(n + 1) = y(n) + ηn(g(y(n)) + βn + ξn) (3–26)
satisfying the following assumptions
1.. g is a continuous Rd valued function on Rd .
2.. {βn} is a bounded with probability 1 sequence of Rd valued random variablessuch that βn → 0 with probability 1.
3.. {ηn} is a sequence of positive real numbers such that ηn → 0,∑
n ηn =∞.
4.. {ξn} is a sequence of Rd valued random variables and such that for some T > 0and each ϵ > 0
limn→∞
P
supj≥n
maxt≤T
∣∣∣∣∣∣m(jT+t)−1∑i=m(jT )
ηiξi
∣∣∣∣∣∣ ≥ ε
= 0
where m(t) is defined by max{n : tn ≤ t} for t ≥ 0 and tn =∑n−1
i=0 ηi .
Also, let {y(n)} be bounded with probability 1. Then, there is a null set 0 such that
ω /∈ 0 implies that {y n(·)} is equicontinuous, and also that the limit y(·) of any
convergent subsequence of {y n(·)} is bounded and satisfies the ordinary differential
equation (ODE)
y ′ = g(y) (3–27)
on the time interval (−∞,∞). Let y0 be a locally asymptotically stable (in the sense of
Liapunov) solution to (3–27) with domain of attraction DA(y0). Then, if ω /∈ 0 and there
is a compact set A ⊂ DA(y0) such that y(n) ∈ A infinitely often, we have y(n) → y0 as
n →∞ [24].
When we set y(n) in (3–26) as X⊤wn, (3–25) satisfies the ordinary differential
equation (3–27). Therefore, under the assumptions in Theorem 3.3, the differential
40
equation is asymptotically stable to E [d |x(i)], that is, wn → w ∗ as n →∞ with probability
1 where X⊤w ∗ = E [d |x(i)] and X is full rank.
3.2 Kernel Temporal Difference(λ)
In the previous section, we introduced TD(λ), and we observed how the value
function can be estimated adaptively. Note that the TD(λ) approximates the value
function using a linear function, which may be limited in practice.
As an alternative, algorithms with nonlinear approximation capabilities have
become a topic of growing interest. Nonlinear variants of TD algorithms have also
been proposed, and they are mostly based on time delay neural networks, sigmoidal
multilayer perceptrons, or radial basis function networks. Despite their good approximation
capabilities, these algorithms are usually prone to fall into local minima [3, 7, 20, 54],
which does not guarantee the optimality of TD(λ). Kernel methods have become
an appealing choice due to their elegant way of dealing with nonlinear function
approximation problems; the kernel based algorithms have nonlinear approximation
capabilities, yet the cost function can be convex [44].
In the following, we will show how the conventional TD(λ) algorithm can be
extended using kernel functions to obtain nonlinear variants of the algorithm; we
introduce a kernel adaptive filter implemented with stochastic gradient on temporal
differences called kernel temporal difference (KTD)(λ).
3.2.1 Kernel Methods
The basic idea of kernel methods is to nonlinearly map the input data to a high
dimensional feature space of vectors. Let X be a nonempty set. For a positive definite
function κ : X × X → R [28, 44], there exists a Hilbert space H and a mapping
ϕ : X → H such that
κ(x , y) = ⟨ϕ(x),ϕ(y)⟩. (3–28)
The inner product in the high dimensional feature space can be calculated by evaluating
the kernel function in the input space. Here, H is called a reproducing kernel Hilbert
41
space (RKHS) because it satisfies the following property
f (x) = ⟨f ,ϕ(x)⟩ = ⟨f ,κ(x , ·)⟩,∀f ∈ H. (3–29)
The mapping implied by the use of the kernel function can also be understood
through Mercer’s Theorem (Appendix A) [33]. These properties allow us to transform
conventional linear algorithms in the feature space to non-linear systems without
explicitly computing the inner product in the high dimensional space.
3.2.2 Kernel Temporal Difference(λ)
In supervised learning, a stochastic gradient solution to least squares function
approximation using a kernel method called kernel least mean square (KLMS) is
introduced in [27]. The KLMS algorithm attempts to minimize the risk functional
E[(d − f (x))2
]by minimizing the empirical risk J(f ) =
∑N
n=1(d(n) − f (x(n)))2 on
the space H induced by the kernel κ. Using (3–29), we can rewrite
J(f ) =
N∑n=1
[d(n)− ⟨f ,ϕ(x(n))⟩]2 (3–30)
By differentiating the empirical risk J(f ) with respect to f and approximating the sum by
the current difference (stochastic gradient), we can derive the following update rule
f0 = 0
fn = fn−1 + ηe(n)ϕ(x(n))(3–31)
where e(n) = d(n)− fn−1(x(n)), which corresponds to KLMS algorithm [27]. Given a new
state x(n), the output can be calculated using the kernel expansion,
fn−1(x(n)) = fn−2(x(n)) + ηe(n − 1)κ⟨x(n − 1), x(n)⟩ (3–32)
= ηn−1∑k=1
e(k)κ⟨x(k), x(n)⟩. (3–33)
As we mentioned above, for a multi-step prediction problem, we can simply say
y(n) = f (x(n)). Let the function f belong to an RKHS H as in KLMS. By treating the
42
observed input sequence and the desired prediction as a sequence of pairs (x(1), d),
(x(2), d), · · · , (x(m), d) and making d , y(m + 1), we can obtain the updates of function
f after the whole sequence of m inputs has been observed as
f ← f +
m∑n=1
�fn (3–34)
= f + η
m∑n=1
e(n)ϕ(x(n)) (3–35)
= f + η
m∑n=1
[d − f (x(n))]ϕ(x(n)). (3–36)
Here, �fn = η[d − ⟨f ,ϕ(x(n))⟩]ϕ(x(n)) are the instantaneous updates of the function
f from input data based on the kernel expansions (3–29). By replacing the error d −
f (x(n)) using the relation with temporal differences (3–6) and reorganizing the equation
(3–36) as in the TD(λ) derivation from [50], we can obtain the following update
f ← f + η
m∑n=1
[f (x(n + 1))− f (x(n))]
n∑k=1
ϕ(x(k)), (3–37)
and generalizing for λ yields
f ← f + ηm∑n=1
[f (x(n + 1))− f (x(n))]
n∑k=1
λn−kϕ(x(k)). (3–38)
The temporal differences f (x(n + 1)) − f (x(n)) can be rewritten using the kernel
expansions as ⟨f ,ϕ(x(n + 1))⟩ − ⟨f ,ϕ(x(n))⟩. This yields
f ← f + η
m∑n=1
⟨f ,ϕ(x(n + 1))− ϕ(x(n))⟩n∑
k=1
λn−kϕ(x(k)), (3–39)
where ��fn = η⟨f ,ϕ(x(n + 1)) − ϕ(x(n))⟩∑n
k=1 λn−kϕ(x(k)). This update rule (3–39)
is called kernel temporal difference (KTD)(λ) [1, 2]. Using the RKHS properties, the
evaluation of the function f at a certain x can be calculated as a kernel expansion.
When λ = 0, the update rule becomes
f ← f + ηm∑n=1
⟨f ,ϕ(x(n + 1))− ϕ(x(n))⟩ϕ(x(n)), (3–40)
43
and it is noticeable that the update rule is exactly of the same form as KLMS (3–36)
except for the error terms; in supervised learning, the error is defined as the difference
between desired signal and predictions at time n, whereas in TD learning, the error is
the difference between sequential predictions.
In addition, equation (3–39) can be modified for state value function approximation
by replacing the error term using (3–18);
f ← f + η
m∑n=1
[r(n + 1) + γV (x(n + 1))− V (x(n))]
n∑k=1
λn−kϕ(x(k)) (3–41)
= f + ηm∑n=1
[r(n + 1) + ⟨f , γϕ(x(n + 1))− ϕ(x(n))⟩]n∑
k=1
λn−kϕ(x(k)). (3–42)
3.2.3 Convergence of Kernel Temporal Difference(λ)
Based on the convergence guarantees for TD(λ), we are able to extend the result to
the convergence of KTD(λ).
• λ = 1 case
Theorem 3.1 shows that in the case of TD with λ = 1, its solution converges to the same
solution as the supervised learning (least square) due to the derivation of TD update
rule based on (3–6). We can also use this relation to show the convergence of KTD(1).
[27] proved the following proposition;
Proposition 3.1. The KLMS algorithm converges asymptotically in the mean sense to
the optimal solution under the “small-step-size” condition [27].
In a multi-step prediction problem, KTD(1) is derived by replacing the error in
supervised learning with the TD error term using (3–6). Thus, we obtain the following
theorem;
Theorem 3.4. On multi-step prediction problems, the KTD(1) procedure produces the
same pre-sequence weight changes as the least square solution.
44
Proof. Since by (3–6) the sequence of TD errors can be replaced by a multistep
prediction with error e(n) = d − y(n), the result of Proposition 3.1 also applies in this
case.
This means that KTD(1) also asymptotically converges to the optimal solution when
the stepsize satisfies∑
n ηn =∞ and∑
n η2n <∞ for n ≥ 0.
• λ < 1 case
For general λ cases (λ < 1), we saw that the convergence of TD heavily relies on the
state representation x(n); the convergence is proved given that the state feature vectors
are linearly independent (Theorem 3.4 and 3.3).
Many models can be reformulated using a dual representation, and this idea
naturally arises when using kernel functions. We derived KTD(λ) using the dual
representation to express the solution of TD(λ) in terms of the kernel function. Note
that the weight vector in the RKHS can be expressed as the linear combination of the
feature vectors ϕ(x) (Proposition 3.2).
Therefore, we can extend Theorem 3.2 and the convergence proof to TD(0 < λ < 1)
to KTD(λ < 1) by showing that the feature map ϕ creates a representation of states in
the RKHS satisfying the linear independence assumption when the kernel κ is strictly
positive definite. This implies that the convergence guarantee of TD(λ < 1) can be
extended to KTD(λ < 1) when it is viewed as a linear function approximator in the
RKHS.
Proposition 3.2. If κ : X × X → R is a strictly positive definite kernel, for any finite set
{xi}Ni=1 ⊆ X of distinct elements, the set {ϕ(xi)} is linearly independent.
Proof. If κ is strictly positive definite, then∑
αiαjκ(xi , xj) > 0 for any set xi where xi ̸= xj ,
∀i ̸= j , and any αi ∈ R such that not all αi = 0. Suppose there exists a set {xi} for which
{ϕ(xi)} are not linearly independent. Then, there must be a set of coefficients αi ∈ R not
45
all equal to zero such that∑
αiϕ(xi) = 0, which implies that ∥∑
αiϕ(xi) ∥2= 0
0 =∑
αiαj⟨ϕ(xi),ϕ(xj)⟩ (3–43)
=∑
αiαjκ(xi , xj), (3–44)
which contradicts the assumption.
This shows that if a strictly positive definite kernel is used, the condition of linearly
independent state representations is satisfied in KTD(λ). This is a necessary condition
for convergence of TD(0) in Theorem 3.2 and TD(0 < λ < 1) based on the ODE
representation from Theorem 3.3.
3.3 Correntropy Temporal Differences
In the previous chapter, we focused our attention on the functional mapper of the
adaptive system. In the present chapter, we will turn our attention towards the cost
function. A common issue in practical scenarios is that the assumptions about noise or
the model may not hold or are subject to perturbations. Most studies on TD algorithms
showing the performance on synthetic experiments such as the noiseless Markov
chain or random walk problems do not usually address the issue of how noise or small
perturbations to the model affect performance. In practice, noisy state transitions
or rewards may be observed, and noise may even be present in the input state
representations. Highly noise-corrupted environments lead to difficulties in learning,
and this may result in failure to obtain the desired behavior of the controller.
One of the most popularly utilized figures of merit is the mean square error
(MSE), which is a second order statistic, and methods such as TD(λ) and KTD(λ)
use this criterion. It is well known that the MSE criterion is most useful only under
Gaussianity assumptions [20]; nevertheless, any departure from this behavior can affect
performance significantly. Correntropy [42] is an alternative to MSE that has been shown
to be able to deal with situations where the Gaussianity does not hold. One of the main
features of correntropy as a cost function is its robustness to large perturbations in the
46
learning process; performance improvements over MSE in many realistic scenarios
including fat-tail distributions and severe outlier noise have been demonstrated in
[26, 59].
3.3.1 Correntropy
The generalized correlation function called correntropy was first introduced in [42].
Correntropy is defined in terms of inner products of vectors in a kernel feature space
B(X ,Y ) = E [κ(X − Y )] (3–45)
where X and Y are two random variables, and κ is a translation invariant kernel. When
κ is the Gaussian kernel, the Taylor series expansion of correntropy is given by
B(X ,Y ) =1√2πhc
∞∑n=0
(−1)n
2nh2nc n!E [∥X − Y ∥2n]. (3–46)
This expansion shows that correntropy includes all the even-order moments of the
random variables ∥X − Y ∥. A different kernel can lead to a different expansion, but
what it is noticeable is that by using a nonlinear kernel, correntropy contains information
beyond second order statistics of the statistical distribution, and thus it is better suited
for non-linear and non-Gaussian signal processing. It has also been observed that in an
impulsive noise environment, correntropy can obtain performance improvements over
the conventional MSE criterion [26]
MSE(X ,Y ) = E [(X − Y )2]. (3–47)
The geometric meaning of correntropy in the sample space can be explained
through the correntropy induced metric (CIM). The correntropy induced metric (CIM) is
defined as follows
CIM(X ,Y ) = (κ(0)− B(X ,Y ))1/2, (3–48)
where Gaussian kernel κ(x , y) = exp(
−∥x−y∥22h2
c
)is used, and input space vectors are
X = (x1, x2, · · · , xN)⊤ and Y = (y1, y2, · · · , yN)⊤.
47
CIM h=0.2
e1=y
1−d
1
e 2=y 2−
d 2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 3-2. Contours of CIM(X , 0) in 2 dimensional sample space.
Figure 3-2 shows the behavior of the CIM based on a sample {(x1, y1), (x2, y2)} of
size N = 2. Here, kernel size hc = 0.2 is applied. Whereas MSE measures the L2-norm
distance between random variables with finite variance, CIM based on the Gaussian
kernel approximates the L2-norm distance only for points that are close, and as points
get further apart the metric goes to a transition phase where it resembles L1-norm
distance and finally approaches the L0-“norm” for points that are far away.
Notice that if only one of the errors is large, CIM does not change as long as the
other error is small. This behavior shows how CIM can effectively deal with outliers.
Furthermore, the kernel bandwidth hc controls the scale of CIM norm; a smaller kernel
size enlarges the region for the L0-“norm”, and a larger kernel size extends the L2-norm
area. Thus, selecting a proper kernel size is necessary.
3.3.2 Maximum Correntropy Criterion
Correntropy can be used as a cost function, and it has been applied to adaptive
systems [45, 46, 59]. Let κ the shift invariant kernel employed in correntropy. The cost
48
function can be written as
J = E [κ(e)] (3–49)
≈ 1
N
N∑n=1
κ(e(n)). (3–50)
For a system described by parametric mapping y = f (x |θ), the parameter set θ can be
adapted such that the correntropy of error signal d − y is maximized. This is called the
maximum correntropy criterion (MCC)
MCC = maxθ
N∑n=1
κ(e(n)). (3–51)
The MCC can be understood in the context of M-estimation, which is a generalized
maximum likelihood method to estimate parameters θ under the cost function
minθ
N∑n=1
ρ(e(n)|θ), (3–52)
where ρ is a differentiable function satisfying ρ(e) ≥ 0, ρ(0) = 0, ρ(e) = ρ(−e), and
ρ(e(i)) ≥ ρ(e(j)) for |e(i)| > |e(j)|. This general estimation is equivalent to a weighted
least square problem
minθ
N∑n=1
w(e(n))e(n)2, (3–53)
where w(e) = ρ′(e)/e and ρ′ is the derivative of ρ. When ρ(e) = (1−exp(−e2/2h2c))/√2πhc ,
the generalized likelihood problem becomes MCC. The relation to the weighted least
squares problem becomes obvious by looking at the gradient of J, for which a Gaussian
weighting term places more emphasis on small errors, and diminishes the effect of large
errors. This property is key to the robustness to outliers or sudden perturbations in the
error. Notice that the kernel size still controls the weights.
3.3.3 Correntropy Temporal Difference
A variant of the least mean square (LMS) algorithm [45] using MCC has been
formulated in supervised learning. Similar to the MSE criterion, a stochastic gradient
49
ascent approach can be used to maximize correntropy between desired signal d(n) and
the system output y(n). Let G denote the Gaussian kernel employed by correntropy. The
gradient of the cost function is expressed as follows
∇Jn =∂B(d(n), y(n))
∂w=
∂G(e(n))∂e(n)
· ∂e(n)∂w
(3–54)
=1
h2ce(n)G(e(n)) · ∇wy(n). (3–55)
In addition, in the previously described multi-step prediction problem, temporal
difference (TD) error can be linked to the LMS algorithm by using the recursion (3–6).
Therefore, we can also apply TD with MCC as follows;
w ← w + ηm∑n=1
�wn (3–56)
= w + η
m∑n=1
[e(n)G(e(n))∇wy(n)] (3–57)
= w + η
m∑n=1
[m∑
k=n
e(k)exp
(−(
∑m
k=n e(k))2
2h2c
)x(n)
], (3–58)
where e(n) = y(n + 1)− y(n) when y(n) is a linear function of x(n).
In the case of λ = 0, we saw that supervised learning algorithms and their extended
TD algorithms have exactly the same form of update rule except for the error terms.
Thus, we can obtain a direct extension for correntropy temporal difference (CTD) as
follows
w ← w + η
m∑n=1
[(y(n + 1)− y(n))exp
(−((y(n + 1)− y(n))2
2h2c
)x(n)
]. (3–59)
Equation (3–59) also satisfies the weight updates in the case of single step prediction
problems (when m = 1).
3.3.4 Correntropy Kernel Temporal Difference
Using the ideas of both kernel least mean square (KLMS) and maximum correntropy
criterion, kernel maximum correntropy (KMC) is introduced in [59]. Again, to maximize
50
the error signal correntropy, we can use stochastic gradient ascent, and the updates
to the system are based on the positive gradient of the new cost function in the feature
space. Thus, in KMC, the gradient can be expressed as follows;
∇Jn =∂B(d(n), y(n))
∂f=
∂G(e(n))∂e(n)
· ∂e(n)∂f
(3–60)
=1
hc2e(n)G(e(n)) · ϕ(x(n)). (3–61)
Also, the estimated function at time n + 1 can be obtained as
f0 = 0 (3–62)
fn+1 = fn + η∇Jn (3–63)
= fn + η
[exp
(−e(n)2
2h2c
)e(n)ϕ(x(n))
](3–64)
= fn−1 + η
n∑i=n−1
[exp
(−e(i)2
2h2c
)e(i)ϕ(x(i))
](3–65)
= ηn∑
i=1
[exp
(−e(i)2
2h2c
)e(i)ϕ(x(i))
](3–66)
Again, by using the error relation in supervised and TD learning in (3–6), in the
multistep prediction problem, the temporal difference(TD) error can be integrated in KMC
as follows
f ← f + ηm∑n=1
[exp
(−(
∑m
k=n(y(k + 1)− y(k)))2
2h2c
) m∑k=n
(y(k + 1)− y(k))ϕ(x(n))
].
(3–67)
In the case of λ = 0, we saw that the KLMS (3–36) and KTD(0) (3–40) update rules
have exactly the same form except for the error terms. Thus, we can derive correntropy
kernel temporal difference (CKTD) as follows
f ← f + η
m∑n=1
[exp
(−(y(n + 1)− y(n))2
2h2c
)(y(n + 1)− y(n))ϕ(x(n))
]. (3–68)
This equation also satisfies (3–67) in the case of single step predictions (m = 1).
51
Note that compared to TD(0) (3–14) and KTD(0) (3–40), the only difference
between the CTD (3–59) and CKTD (3–68) update rules is the extra weighting term
which is the exponential of the error. Therefore, the stability result from Theorem 3.3
should also apply in the case of correntropy since the extra weighting term can be
factor together with the stepsize. This should not change the conditions on the stepsize
sequence, ηn → 0,∑
n ηn = ∞, since we employ the Gaussian kernel for correntropy
satisfying 0 ≤ G(e) ≤ 1. Nevertheless, the convergence points for correntropy TD and
TD will be different in general.
In this chapter, three new temporal difference algorithms were introduced for
state value function estimation. First, an algorithm that combines kernel based
representations with conventional TD learning, kernel temporal difference (KTD)(λ), was
introduced. One of the key advantages of KTD(λ) is its nonlinear function approximation
capability in the input space with convergence guarantees. Because of the linear
structure of the computations that are implicitly carried out in the feature space through
the kernel expansion, existing results on linear function approximation can be extended
to the kernel setting. Following this, the maximum correntropy criterion (MCC) as a
robust alternative to MSE was applied to TD(λ) and KTD(λ) algorithms. We introduced
the correntropy temporal difference (CTD) and correntropy kernel temporal difference
(CTD) algorithms. These algorithms are shown to be stable and robust under noisy
conditions. Note that nonlinear function approximation capabilities and robustness are
appealing properties for practical implementations. Learning methods with nonlinear
function approximation capabilities has been the subject of active research. However,
the lack of convergence guarantees has been an issue that makes this avenue less
attractive for real applications. A powerful aspect of KTD(λ) is due to its approximation
mechanism which overcomes the convergence issue.
52
CHAPTER 4SIMULATIONS - POLICY EVALUATION
In this chapter, we examine the empirical performance of the temporal difference
algorithms introduced in the previous sections to the problem of state value function
estimation ~V given a fixed policy π.
First of all, we carry out experiments on a simple illustrative Markov chain described
in [6]; we refer to this problem as the Boyan chain problem. This is a popular experiment
involving an episodic task to test TD learning algorithms. The experiment is useful in
illustrating linear as well as nonlinear functions of the state representations, and shows
how the state value function is estimated using adaptive systems. TD(λ) and KTD(λ)
are compared in the linear and nonlinear function approximation problem. Furthermore,
TD(λ), KTD(λ), CTD, and CKTD are applied to a noisy environment where the policy
does not remain fixed but is randomly perturbed.
4.1 Linear Case
To test the efficacy of the proposed method, we first observe the performance
on a simple Markov chain (Figure 4-1). There are 13 states numbered from 12 to
0. Each trial starts at state 12 and terminates at state 0. Each state is represented
by a 4-dimensional vector, and the rewards are assigned in such a way that the
value function V is a linear function on the states; namely, V ∗ takes the values
[0,−2,−4, · · · ,−22,−24] at states [0, 1, 2, · · · , 11, 12]. In the case of V = w⊤x , the
optimal weights are w ∗ = [−24,−16,−8, 0].
To assess the performance of the algorithms, the updated estimate of the state
value function ~V (x) is compared to the optimal value function V ∗ at the end of each trial.
This is done by computing the RMS error of the value function over all states
RMS =
√1
n
∑x∈X
(V ∗(x)− ~V (x)
)2, (4–1)
where n is the number of states, n = 13.
53
Figure 4-1. A 13 state Markov chain [6]. For states from 2 to 12, the state transitionprobability is 0.5 and the corresponding reward −3. State 1 has statetransition probability of 1 to the terminal state 0 and reward −2. State 12, 8,4, and 0 have 4-dimensional state space representations [1, 0, 0, 0],[0, 1, 0, 0], [0, 0, 1, 0], and [0, 0, 0, 1] respectively, and the representations ofthe other states are linear interpolations between the above vectors.
We saw in the previous chapter that the stepsize is required to satisfy η(n) ≥
0,∑∞
n=1 η(n) = ∞, and∑∞
n=1 η(n)2 < ∞ to guarantee convergence. Consequently, the
following stepsize scheduling is applied;
η(n) = η0a0 + 1
a0 + n, where n = 1, 2, · · · . (4–2)
where η0 is the initial stepsize, and a0 is the annealing factor which controls how fast the
stepsize decreases. In this experiment, a0 = 100 is applied. Furthermore, we assume
that the policy π is guaranteed to terminate, which means that the value function V π is
well-behaved without using a discount factor γ in (2–6); that is, γ = 1.
Using the above set up, we first apply TD(λ) to estimate the value function
corresponding to the Boyan chain (Figure 4-1). To obtain the optimal parameters,
various combinations of eligibility trace rates λ and initial step size η0 values are
evaluated. Eligibility trace rates λ from 0 to 1 with 0.2 jumps and initial stepsizes η0
between 0.1 and 0.9 with 0.1 intervals are observed for 1000 trials (Figure 4-2). The
RMS error of the value function are averages over 10 Monte Carlo runs, and the initial
weight vector is set as w = 0 at each run.
Across all values of λ with optimal stepsize, TD(λ) provides good approximation to
V ∗ after 1000 trials. We observe that small stepsizes generally give better performance.
54
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
η0
= 0.1
η
η
η
η
η
η
η
η
Figure 4-2. Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in TD(λ). The plotted vertical line segments are themean RMS values after 100 trials (top markers), 500 trials (middle markers),and 1000 trials (bottom markers).
However, if the stepsize is very small, the system fails to reach a good performance
level, especially with small λ values (λ = 0, 0.2, and 0.4). We know that stepsize mainly
controls the trade off between performance accuracy and speed of learning, so small
stepsize learning may be too slow to converge within 1000 trials. Also, large stepsizes
result in larger error due to mis-adjustment. Based on Figure 4-2, parameter values of
λ = 1 and η0 = 0.1 are selected for further observation.
Before we extend the experiment, we want to observe the behavior of KTD(λ) in
a linear function approximation problem. We previously emphasized the capability of
KTD(λ) as a nonlinear function approximator, however, under the appropriate kernel
size, KTD(λ) should approximate linear functions well on a region of interest.
In KTD(λ), we employ the Gaussian kernel,
κ(x(i), x(j)) = exp
(−∥x(i)− x(j)∥2
2h2
), (4–3)
which is a universal kernel commonly encountered in practice. To find the optimal kernel
size, we fix all the other free parameters around median values, λ = 0.4 and η0 = 0.5,
and the average RMS error over 10 Monte Carlo runs is compared (Figure 4-3).
For this specific experiment, it seems obvious that smaller kernel sizes yield better
performance, since the state representations are finite. However, in general, applying
55
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Kernel size, h
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
Figure 4-3. Performance over different kernel sizes in KTD(λ). The vertical linesegments contain the mean RMS values after 100 trials (top markers), 500trials (middle markers), and 1000 trials (bottom markers).
too small kernel sizes leads to over-fitting or in this case to slow learning. In particular,
choosing a very small kernel leads to a procedure very similar to a table look up method.
Thus, we choose the kernel size h = 0.2 to be the largest kernel size for which we obtain
similar mean RMS values as those for h = 0.1 and h = 0.05 at 1000th trial, and the
lowest mean RMS at the 100th trial.
After fixing the kernel size at h = 0.2, the experimental evaluation of different
combinations of eligibility trace rates λ and initial step sizes η0 are observed. Figure 4-4
shows the average performance over 10 Monte Carlo runs for 1000 trials.
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
η0
= 0.1
η
η
η
η
η
η
η
η
Figure 4-4. Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in KTD(λ) with h = 0.2. The plotted vertical linesegments contain the mean RMS value after 100 trials (top marker), 500trials (middle marker), and 1000 trials (bottom marker).
56
All λ values with optimal stepsize show good approximation to V ∗ after 1000 trials.
Smaller stepsizes with larger λ values show better performance in TD(Figure 4-2),
whereas larger stepsizes with smaller λ performs better in KTD. Notice that KTD(λ = 0)
shows slightly better performance than KTD(λ = 1), this may be attributed to the local
nature of KTD when using the Gaussian kernel. In addition, varying the stepsize has a
relatively small effect on KTD(λ). Again the Gaussian kernel as well as other normalized
kernels provide an implicit normalized update rule which is known to be less sensitive to
step size. Based on the Figure 4-4, the optimal eligibility trace rate and initial stepsize
values λ = 0.6 and η0 = 0.3 is selected for KTD with kernel size h = 0.2.
The learning curves of TD(λ) and KTD(λ) are compared. The optimal parameters
are employed in both algorithms based on the experimental evaluation (λ = 1 and
η0 = 0.1 for TD and λ = 0.6 and η0 = 0.3 for KTD), and the RMS error is averaged over
50 Monte Carlo runs for 1000 trials. Comparative learning curves are given in (Figure
4-5).
Both algorithm reach the mean RMS value of around 0.06. Here, we confirmed
the ability of TD(λ) and KTD(λ) to handle the function approximation problem when the
fixed policy yields a state value function that is linear in the state representation. As we
expected, TD(λ) converges faster to the optimal solution because of the linear nature
of the problem. KTD(λ) converges slower than TD(λ), but it is also able to approximate
the value function properly. In this sense KTD is open to wider class of problems than its
linear counterpart.
Also, the estimated state values ~V for the last 50 trials are observed in Figure 4-6. It
shows that both TD(λ) and KTD(λ) successfully estimate the optimal state value V ∗.
4.2 Linear Case - Robustness Assessment
In this section, we want to observe the role of the cost function in the adaptation
process. In the following experiment, we consider the same Boyan chain from the
previous linear case (Figure 4-1), but unlike the above case, the rewards are random
57
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Trial number
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
KTDTD
Figure 4-5. Learning curve of TD(λ) and KTD(λ). The solid line shows the mean RMSerror and the dashed line shows the standard deviations over 50 MonteCarlo runs.
950 960 970 980 990 1000−25
−20
−15
−10
−5
0
Trial Number
Est
imat
ed v
alue
func
tion
KTD
950 960 970 980 990 1000−25
−20
−15
−10
−5
0
Trial Number
TD
Figure 4-6. The comparison of state value V (x) in x ∈ X convergence between TD(λ)and KTD(λ). The solid line shows the optimal state values V ∗ and thedashed line shows the estimated state values ~V by TD(λ) (left) and KTD(λ)(right).
variables themselves. We refer to them as noisy rewards. Three types of noise will be
added to the original discrete reward values, and behaviors between TD and correntropy
TD will be compared.
First, Gaussian noise with probability density function
G(µ,σ2) = 1√2πσ2
exp
(−(x − µ)2
2σ2
)(4–4)
58
is added to the rewards. Here, the mean µ is set as zero, and different variance values
(σ2 = 0.2, 0.5, 1, 2, 10, 20, 50) are applied. From Figure 4-2, we observed that the
parameter set λ = 1 and η0 = 0.1 leads to the best performance. However, for fair
comparison with CTD, TD with λ = 0 and η0 = 0.3 will be applied. To observe the
influence of the Gaussian noise in the performance of the TD algorithm, we apply it for
two parameter sets (λ = 1 and η0 = 0.1 (red line) and λ = 0 and η0 = 0.3 (blue line) as
depicted in Figure 4-7). The RMS error is averaged over 50 Monte Carlo runs, and the
plot shows the mean and standard deviation at the 1000th trial.
10−1
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Variance of noise distribution
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
λ = 1, η
0 = 0.1
λ = 0, η0 = 0.3
Figure 4-7. The performance of TD for different levels (variances σ2) of additiveGaussian noise on the rewards.
In both parameter sets, increasing the noise variance worsens the performance
of TD in terms of the mean and standard deviation of the average RMS errors. Now,
we exam how CTD behaves with respect to the noise variance. Recall that correntropy
itself requires setting an extra kernel size that is different from the kernel size parameter
required in KTD. To distinguish the two kernels, we will refer to the kernel hc as the
correntropy kernel. Different correntropy kernel sizes (hc = 1, 2, 3, 4, 5, 10, 20, 50, 100)
are applied to CTD algorithm, and its performance is observed with respect to the noise
variances, σ2 = 0.2, 1, 10, 50 (Figure 4-8).
59
Smaller correntropy kernel sizes hc yield higher RMS error, but as the correntropy
kernel size gets larger, the error converges to values similar to those obtained with TD
for λ = 0 and η0 = 0.3 (The blue line in Figure 4-7). This result is intuitive since MSE
is optimal under Gaussian assumptions and for a large enough kernel size, correntropy
behaves similar to MSE. We further look into the learning behavior of TD and CTD when
the noise variance σ2 = 10 is taken (Figure 4-9). The RMS error is averaged over 50
Monte Carlo runs, and the mean and standard deviation at the 1000th trial are displayed.
As we show in Figure 4-7 and 4-8C, similar mean and variance for TD and CTD
can be observed. Nevertheless, CTD shows smoother learning curves than TD. This
is expected since correntropy behaves similar to MSE when hc → ∞. From the
comparison between the two different correntropy kernel sizes, hc = 5 and hc = 10,
we observe that the smaller correntropy kernel size has slower convergence rates. In
this experiment, we confirmed that MSE criteria is optimal in the case of Gaussian noise
with zero mean, and that CTD is also able to approximate the value function with proper
choice of kernel size hc . Note that Gaussian noise with different variances are added to
the assigned reward in Figure 4-8.
Secondly, we explore the behavior of TD and CTD under outlier noise conditions;
the mixture of Gaussian distributions, 0.9G(0, 1) + 0.1G(5, 1), is added to the reward
values. From Figure 4-2, we know that for TD(λ = 0), the initial stepsize η0 = 0.3 is
optimal. To find the optimal correntropy kernel size hc in CTD, we evaluate different
correntropy kernel sizes, hc = 1, 2, 2.5, 3, 5, 10, 20, 50, 100 (Figure 4-10).
Small correntropy kernel sizes, hc = 1 and hc = 2, lead to large RMS error.
In this case, the convergence can be very slow, and only in a small vicinity of the
optimal solution, the gradient has values that will make the adaptive system respond
accordingly. When hc = 2.5 correntropy TD shows the lowest RMS error, and as hc
increases, the average RMS increases and converges to similar results as TD. Again, it
is obvious that the large correntropy kernel size takes into account larger values of the
60
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
A σ2 = 0.2
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
B σ2 = 1
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
C σ2 = 10
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
D σ2 = 50
Figure 4-8. The performance change of CTD over different correntropy kernel sizes, hc .
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Trial Number
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
TDCTD with h
c = 5
CTD with hc = 10
Figure 4-9. Learning curve of TD and CTD when the Gaussian noise with varianceσ2 = 10 is added to the reward. RMS error is averaged over 50 Monte Carloruns, and the solid line shows the mean RMS error, and the dashed linerepresents the standard deviations.
error, especially those of the second component of the mixture. The learning curves of
TD and CTD are compared in Figure 4-11.
61
100
101
102
0
1
2
3
4
5
6
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
Figure 4-10. Performance of CTD corresponding to different correntropy kernel sizes hc ,with mixture of Gaussian noise distribution. The RMS error is averagedover 50 Monte Carlo runs, and the plot shows the mean and standarddeviation at the 1000th trial.
0 200 400 600 800 10000
0.5
1
1.5
2
2.5
3
Trial Number
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
TDCTD with h
c = 2.5
CTD with hc = 5
CTD with hc = 100
Figure 4-11. Learning curves of TD and CTD when the noise added to the rewardscorresponds to a mixture of Gaussians. RMS error is averaged over 50Monte Carlo runs, and the solid line is the mean RMS error, and thedashed line shows the standard deviation.
We can observe that as correntropy kernel size increases the performance
becomes similar to TD. Even though CTD with the optimal correntropy kernel size
hc initially performs slower than TD, we can observe that the error keeps decreasing
beyond the values obtained with TD. This is a clear example of the robustness of
correntropy to non-Gaussian non-symmetric impulsive noise.
62
It is well known that “heavy tail” distributions such as Laplacian makes the MSE
non-optimal [45, 47]. Thus, our third experiment considers Laplacian distributed additive
noise
L(µ, 2b2) = 1
2bexp
(−|x − µ|
b
)(4–5)
in the assigned reward. The mean µ is set as zero, and different variances (b2 =
0.04, 0.25, 1, 4, 25, 100) are applied. Again, TD with λ = 1 and η0 = 0.1 and λ = 0 and
η0 = 0.3 is applied to observe the influence of the Laplacian noise (Figure 4-12). In both
cases, the performance degrades as the variance increases. Moreover, the RMS values
obtained for Gaussian distributed noise with similar variances are smaller, which goes
along with the fact that MSE is suboptimal in this case.
10−2
10−1
100
101
102
103
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Variance of noise distribution
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
λ = 1, η
0 = 0.1
λ = 0, η0 = 0.3
Figure 4-12. Performance changes of TD with respect to different Laplacian noisevariances b2. The RMS error is averaged over 50 Monte Carlo runs, andthe plot shows the mean and standard deviation at the 1000th trial.
The performance of CTD is observed for different noise variances b2 = 0.04, 1, 25, 100.
The correntropy kernel sizes of hc = 1, 2, 3, 4, 5, 10, 20, 50, 100 are applied. Figure 4-13
shows the corresponding RMS values.
When the noise variance is small (b2 = 0.04 and b2 = 1), the results show
that performance does not degrade as the correntropy kernel size becomes larger
(Figure 4-13A and 4-13B). In this case, the two cost functions, MSE and MCC, do not
expose significant differences. However, when the noise variance is large (b2 = 25
63
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
A b2= 0.04
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
B b2= 1
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
C b2= 25
100
101
102
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
D b2= 100
Figure 4-13. Performance of CTD depending on different correntropy kernel sizes hcwith various Laplacian noise variances. The RMS error is averaged over 50Monte Carlo runs, and the plot shows the mean and standard deviation atthe 1000th trial.
and b2 = 100), certain correntropy kernel sizes show smaller error than other larger
correntropy kernel sizes (Figure 4-13C and 4-13D). Since MSE is not optimal under
’heavy tail’ noise distributions, approximating the behavior of correntropy to MSE
by increasing the kernel size results in worse performances. Figure 4-14 shows the
learning curve of TD and CTD when the Laplician Noise with variance b2 = 25 is added
to the reward.
At the beginning, TD shows a slightly faster convergence rate, but after around
the 50th trial, CTD reaches lower RMS error than TD. Again, this example verifies the
robustness of correntropy for non-Gaussian scenarios; in particular, heavy tail distributed
noise (sparse).
64
0 200 400 600 800 10000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Trial Number
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
TDCTD with h
c = 10
Figure 4-14. Learning curve of TD and CTD when the Laplacian noise with varianceb2 = 25 is added to the reward. RMS error is averaged over 50 Monte Carloruns, and the solid line is the mean RMS error, and the dashed line showsthe standard deviations.
4.3 Nonlinear Case
We have seen the performances of TD(λ), KTD(λ), and CTD on the problem
of estimating a state value function, which is a linear function of the given state
representation. Now, the same problem can be turned into a nonlinear one by modifying
the reward values in the chain such that the resulting state value function V ∗ is no longer
a linear function of the states.
The number of states and the state representations remain the same as in the
previous section. However, the optimal value function V ∗ becomes nonlinear with
respect to the representation of the states; namely, (V ∗ = [0− 0.2− 0.6− 1.4− 3− 6.2−
12.6− 13.4− 13.5− 14.45− 15.975− 19.2125− 25.5938]) for states 0 to 12. This implies
that the values of rewards for each state are also different from the ones given for the
linear case (Figure 4-15).
Again, to evaluate the performance, after each trial is completed, the estimated
state value ~V is compared to the optimal state value V ∗ using RMS error (4–1) as
described above for the linear case.
65
Figure 4-15. A 13 state Markov chain. In states from 2 to 12, each state transition haveprobability 0.5, and state 1 has transition probability 1 to the absorbing state0. Note that optimal state value functions can be represented as anonlinear function of the states, and corresponding reward values areassigned to each state.
First of all, TD(λ) is applied to estimate the value function with various combinations
of λ and initial step size η0. λ from 0 to 1 with 0.2 intervals and initial stepsizes η0
between 0.1 and 0.9 with 0.1 intervals are observed for 1000 trials. The RMS error of the
value function is the average over 10 Monte Carlo runs, and the initial weight vector is
set as w = 0 at each run (Figure 4-16).
−0.2 0 0.2 0.4 0.6 0.8 1 1.21.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
a0
= 0.1
a
a
a
a
a
a
a
a
Figure 4-16. The effect of λ and the initial step size η0 in TD(λ). The plotted linesegments contain the mean RMS value after 100 trials (top marker), 500trials (middle marker), and 1000 trials (bottom marker).
It is noticeable that larger λ values show better performances, as we know this case
corresponds to the least mean squares solution. The behavior for intermediate cases
(λ < 1) is not guaranteed to converge to the optimal solution since the representation
of all states do not form a linearly independent set of vectors. However, the solution
for λ = 1 will still try to approximate E [d |x ] because of the implicit regularization in the
66
stochastic gradient algorithm. For further observation, TD with λ = 0.8 and η0 = 0.1 will
be applied.
For KTD(λ), the Gaussian kernel (4–3) is applied, and kernel size h = 0.2 is chosen
based on Figure 4-17; after fixing all the other free parameters around median values
λ = 0.4 and η0 = 0.5, the average RMS error for 10 Monte Carlo runs is compared.
Then, performances with different combinations of parameters (λ and η0) are compared
with h = 0.2. Figure 4-18 shows the average RMS error over 10 Monte Carlo runs for
1000 trials.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Kernel size, h
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
Figure 4-17. The performance of KTD with different kernel sizes. The plotted linesegment contains the mean RMS value after 100 trials (top marker), 500trials (middle marker), and 1000 trials (bottom marker).
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
η0
= 0.1
η
η
η
η
η
η
η
η
Figure 4-18. Performance comparison over different combinations of λ and the initialstepsize η in KTD(λ) with h = 0.2. The plotted segment is the mean RMSvalue after 100 trials (top segment), 500 trials (middle segment), and 1000trials (bottom segment).
67
Again, compared to TD, larger stepsizes with smaller λ values perform better in
KTD. The combination of λ = 0.4 and η0 = 0.3 shows the best performance, but
the λ = 0 case also shows good performances. Unlike TD, we can say that there is
no dominant value for λ. Recall that it has been proved that convergence to [d |x ] is
guaranteed for linearly independent representations of the states, which is automatically
fulfilled in KTD when the kernel is universal. Therefore, the differences are rather due
to the convergence speed controlled by the interaction between the step size and the
elegibilty trace. Based on Figure 4-16 and 4-18, optimal step size and eligibility trace
rate values are selected (λ = 0.8 and η0 = 0.1 for TD and λ = 0.4 and η0 = 0.3 for KTD),
and their respective average RMS error over 50 Monte Carlo runs are shown in Figure
4-19.
0 200 400 600 800 10000
0.5
1
1.5
2
2.5
3
Trial number
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
KTDTD
Figure 4-19. Learning curves of TD(λ) and KTD(λ). The solid line shows the mean RMSerror, and the dashed line represents the standard deviation over 50 MonteCarlo runs.
The linear function approximator, TD(λ) (blue line), cannot estimate the optimal
state values, but KTD(λ) outperforms the linear algorithm; this behavior is expected
since the Gaussian kernel is universal. KTD(λ) reaches to the mean value around
0.07, and the mean value of TD(λ) is around 1.8. Figure 4-20 shows the optimal state
value V ∗, and the predicted state value ~V by TD(λ) and KTD(λ) for the last 50 trials.
Notice that TD(λ) tries to estimate the value function by a piece-wise evenly spaced
68
pattern. This is associated with the degrees of freedom of the representation space
(4-dimensional for the present case). In contrast, KTD(λ) is able to faithfully reproduce
the nonlinear behavior of the value function.
950 960 970 980 990 1000
−25
−20
−15
−10
−5
0
Trial Number
Est
imat
ed v
alue
func
tion
KTD
950 960 970 980 990 1000
−25
−20
−15
−10
−5
0
Trial Number
TD
Figure 4-20. The comparison of state value convergence between TD(λ) and KTD(λ).The solid line shows the optimal state values V ∗ and the dashed line showsthe estimated state values ~V by TD(λ) (left) and KTD(λ) (right).
4.4 Nonlinear Case - Robustness Assessment
In this section, we extend the experiment to observe the performances of KTD(λ)
and CKTD under noisy rewards. We will consider the same Boyan chain from the
previous nonlinear case (Figure 4-15), and a noisy reward or perturbed policy will be
employed.
First of all, we add impulsive noise with a probability density function given by
0.95G(0, 0.05)+0.05(0, 5) to the current reward with probability 0.05. This can be thought
of randomly replacing the policy with probability 0.05. Since the state representations
and the optimal state values are the same as with the previous experiment, based on
Figure 4-17 and 4-18, a Gaussian kernel with kernel size h = 0.2 and initial stepsize
η0 = 0.3 with annealing factor a0 = 100 are applied. For fair comparison with correntropy
KTD, λ is set as 0. We validate the optimal correntropy kernel size based on Figure
4-21. We fix all the other free parameters around median values, λ = 0.4 and η0 = 0.5,
and the average RMS errors over 10 Monte Carlo runs are compared.
69
100
101
102
100
Correntropy kernel size, hc
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
Figure 4-21. Performances of CKTD depending on the different correntropy kernel sizes.The plotted line segments contain the mean RMS values after 100 trials(top markers), 500 trials (middle markers), and 1000 trials (bottommarkers). Note the log scales on x and y axis.
We can observe that as the correntropy kernel size gets larger, the 100th mean
RMS error decreases, and we know it converges to the same solution as KTD. However,
after 1000 trials, CKTD with hc = 5 shows the lowest mean RMS error. This implies
that a larger correntropy kernel size brings faster initial convergence speeds, but it fails
to reach lower errors if the correntropy kernel size remains too large. This motivates
the idea that by controlling the correntropy kernel size during the adaptation, we may
obtain fast and robust function approximation. In Figure 4-22, we compare the learning
curves of KTD and CKTD with different correntropy kernel sizes, and Figure 4-23 shows
the estimated state value ~V by KTD and CKTD. In the case of KTD, it is noticeable that
when the undesirable noisy transition occurs the estimation process degrades, and
thus the overall performance is affected. On the other hand, CKTD shows more stable
performance even with the impulsive noise. For CKTD, we apply a fixed correntropy
kernel size hc = 5 (blue line); as expected, it shows a slower convergence rate than KTD
(red line) but lower error values. To obtain faster convergence, we start with hc = 150,
and the correntropy kernel size is switched to hc = 5 at 100th trial. In this way, we can
accelerate initial convergence rates and, after switching, lower error values. A similar
70
switching scheme has already been utilized in correntropy based adaptive filtering
algorithms, but instaead of applying a large initial kernel size the algorithm uses MSE at
the initial stage and then switches to correntropy.
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Trial number
RM
S e
rror
of v
alue
func
tion
over
all
stat
es
KTDCKTD h
c = 5
CKTD with changing hc
Figure 4-22. Learning curve of KTD and CKTD. The solid line is mean RMS error over50 Monte Carlo runs and the dashed line shows the standard deviation.
960 980 1000
−25
−20
−15
−10
−5
0
Trial Number
Est
imat
ed v
alue
func
tion
KTD
960 980 1000
−25
−20
−15
−10
−5
0
Trial Number
CKTD hc = 5
960 980 1000
−25
−20
−15
−10
−5
0
Trial Number
CKTD with changing hc
Figure 4-23. The comparison of state value function ~V estimated by KTD andCorrentropy KTD.
Now, we want to see how perturbed policy influences the performance of KTD
and CKTD. We consider the same Boyan chain from the previous nonlinear case
(Figure 4-15), but the observations come from a policy that has been perturbed. Note
that in this experiment, reward does not contains any noise; noise is only added to
state transitions. At each step, with probability 0.1, the state transitions are made from
71
state i to j uniformly for j = 0, ... , 12, and a reward value correspond to the state xi is
assigned. Since the state representations and the optimal state values are the same
as the previous experiment, a Gaussian kernel with kernel size h = 0.2, initial stepsize
η0 = 0.3 with annealing factor a0 = 100, and λ = 0 are applied. KTD and CKTD are
trained for 2000 trials and the RMS error is averaged over 100 Monte Carlo runs. Figure
4-24 shows the mean and standard deviation of error norm over 100 Monte Carlo at the
2000th trial when CKTD is used with correntropy kernel sizes hc between 1 and 10 with
increments of 1 and also for hc = 15, 20, 100. Figure 4-25 shows the learning curves
of KTD and CKTD with correntropy kernel sizes hc = 2, 3, 5, 6, 7, 100 in terms of mean
RMS error over 100 Monte Carlo runs.
Again, we observe that a larger correntropy kernel size brings faster initial
convergence speeds, but it fails to reach lower errors if the correntropy kernel size
remains too large, showing similar performance to KTD. This experiment provides
evidence that by controlling the kernel size, we can obtain fast and robust function
approximation. Figure 4-24 and 4-25 show that correntropy kernel size hc = 100
performs the same as KTD; in the case of KTD, the mean and standard deviation at
the last trial are 1.7016 and 0.4069, respectively. Although hc = 5 and hc = 6 have
the same interval for the error value in the last trial, since a larger kernel size is faster,
hc = 6 is selected as the optimal correntropy kernel size. In the case of KTD, we observe
that when the undesirable noisy transition occurs the estimation process degrades, and
thus the overall performance is affected. On the other hand, CKTD shows more stable
performance even under the random state transitions.
In this chapter, we examine the behavior of the algorithms introduced in the
previous chapter. We present experimental results on synthetic examples to approximate
the state value function under a fixed policy. In particular, we apply the algorithms
to absorbing Markov chains. We observe that KTD(λ) performs well on both linear
and nonlinear function approximation problems. In addition, we show how the linear
72
100
101
102
0
1
2
3
4
5
6
7
Correntropy Kernel Size, hc
RM
S E
rror
of v
alue
func
tion
over
all
stat
es
Figure 4-24. Mean and standard deviation of RMS error over 100 runs at the 2000th trial.Note the log scale in x-axis.
100
101
102
103
0
1
2
3
4
5
6
7
8
Trial Number
RM
S e
rror
of valu
e function o
ver
all
sta
tes
KTDCKTD : h
c = 2
CKTD : hc = 3
CKTD : hc = 5
CKTD : hc = 6
CKTD : hc = 7
CKTD : hc = 100
Figure 4-25. Mean RMS error over 100 runs. Notice this is a log plot in the horizontalaxis
independence of the input state representations can affect the performance of
algorithms. This is an essential guarantee for the convergence of TD with eligibility
traces. The use of strictly positive definite kernels in KTD(λ) implies the linear independence
condition, and thus this algorithm converges for all λ ∈ [0, 1]. Moreover, we perform
experiments with the maximum correntropy criterion under noisy conditions. Experiments
with heavy tail distributions on noisy rewards and state transition probabilities show that
CTD and CKTD algorithms can improve performance over conventional MSE. In
particular, robust behavior of correntropy is tested for Laplacian noise and impulsive
73
noise that represents the effects of outliers in the reward. Correntropy was also
tested when the policy is randomly replaced; this was achieved by adding a random
perturbation to state transitions. In the following chapters, we will extend the TD
algorithms to estimate the action value function which can be applied in finding a
proper control policy.
74
CHAPTER 5POLICY IMPROVEMENT
We have shown how the kernel based nonlinear mapping and TD(λ) can be
combined in kernel based least mean squares temporal difference learning with
eligibilities called KTD(λ), and we have seen the advantages of KTD(λ) in nonlinear
function approximation problems. Moreover, a new robust cost function based on
correntropy has been integrated into TD and KTD algorithms.
So far, we have only used TD learning algorithms to estimate the state value
function given a fixed policy. However, this is still an intermediate step in RL. Recall that
we want to find a proper state to action mapping that results in maximum return. Since
the value function quantifies the relative desirability of different states in the state space,
it allows comparisons between policies and thus guides the optimal policy search.
Therefore, we can extend the proposed methods to solve complete RL problems.
Here, our goal is to find the optimal control action A(n) at each time n which
maximizes the cumulative reward. When the optimal state value function V ∗ is obtained
(2–12), an optimal policy π∗ can be derived using the state value function; the optimal
action sequence {A(n)} is given by
A(n) = arg maxa∈A(x)
∑x ′
Paxx ′ [Ra
xx ′ + γV ∗(x ′)] . (5–1)
Here, for the sake of simplicity, we denote x(n) by x . However, direct use of (5–1) is
still limited because in practice, Paxx ′ or Ra
xx ′ are unknown most of the time. One way to
get around these issues is with Q-learning [55]. Q-learning allows the estimation of the
optimal value function Q∗(x , a) incrementally, and based on the estimated Q, a proper
policy can be obtained.
From the definition of state-action value functions, we have the following relation:
V ∗(x) = maxa∈A(x)Qπ∗(x , a). This shows that the optimal action process (5–1) can be
75
obtained using the action value function Q
A(n) = arg maxa∈A(x)
Q∗(X (n), a), (5–2)
where {X (n)} is the controlled Markov chain [32].
5.1 State-Action-Reward-State-Action
The first step to apply Q-learning is to estimate the state-action value function Q
instead of the state value function V . State-Action-Reward-State-Action(SARSA) is
introduced to learn the state-action value function Q given a fixed policy. The update
rule of SARSA is as follows,
Q(x(n), a(n))← Q(x(n), a(n))+η[r(n+1)+γQ(x(n+1), a(n+1))−Q(x(n), a(n))]. (5–3)
This shows that to complete one update, the sequence of state-action pair (x(n), a(n)),
corresponding reward r(n+1), and transition to next state-action pair (x(n+1), a(n+1))
are required, and thus the name SARSA.
SARSA has a strong relation with Q-learning. It can be understood as Q-learning
[55] given a fixed policy [52]. Q-learning does not use a fixed policy, but it explores
different policies to ultimately obtain a good policy. For large state X and action A
spaces, we can estimate the Q values using function approximators, but now the
proposed TD(λ) algorithms are applied to state action pairs rather than only to states
[48]. This gives the basic idea of how the TD algorithms can be associated with Q
function approximation in policy evaluation.
5.2 Q-learning
Q-learning is a well known off-policy TD control algorithm. The form of the
state-action mapping function (policy) is undetermined, and TD learning is applied
to estimate the state-action value function. This allows the system to explore policies
towards finding an optimal policy. This is an important feature for practical applications
since prior information about a policy is usually not available.
76
Since value functions represent the expected cumulative reward given a policy,
we can say that the policy π is better than the policy π′ when the policy π gives greater
expected return than the policy π′. In other words, π ≥ π′ if and only if Qπ(x , a) ≥
Qπ′(x , a) for all x ∈ X and a ∈ A. Therefore, the optimal action value function Q can be
written as follows,
Q∗(x(n), a(n)) = maxπ
Qπ(x(n), a(n)) (5–4)
= E
[r(n + 1) + γ max
a(n+1)Q∗(x(n + 1), a(n + 1))
∣∣∣∣ x(n), a(n)] (5–5)
The equation (5–5) can be estimated online, and a one-step Q-learning update can be
defined as,
Q(x(n), a(n))← Q(x(n), a(n))+η[r(n+1)+γmaxa
Q(x(n+1), a)−Q(x(n), a(n))], (5–6)
to maximize the expected reward E [r(n + 1)| x(n), a(n), x(n + 1)]. At time n, an action
a(n) can be selected using methods such as ϵ-greedy or the Boltzmann distribution,
which are commonly applied [53].
In the case that the state X and action A sets are finite, (5–6) allows explicitly
computing the action value function Q. However, when the state X and action A are
infinite or very large, it is infeasible to obtain explicit Q values. Thus, we will see how
functional approximation can be integrated into Q-learning.
5.3 Q-learning via Kernel Temporal Differences and Correntropy Variants
We have seen how temporal difference algorithms approximate the state value
functions using a parametrized family of functions. In Q-learning, the state-action
value function Q can be approximated using the proposed methods (KTD(λ), CTD, and
CKTD).
This can be done using the same methods employed for the state value function
estimation. We previously approximated the state value function using a parametrized
family of functions such as ~V (x(n)) = f (x(n),w) using TD algorithms. We can apply the
77
same approach to approximate the state-action value function:
~Q(x , a = i) = f (x ,w |a = i). (5–7)
In the case of the linear function approximators (TD(λ) and CTD), the action value
function can be estimated as ~Q(x(n), a = i) = w⊤x(n), and for their kernel extensions
(KTD(λ) and CKTD), the action value function can be approximated as ~Q(x(n), a = i) =
⟨f ,ϕ(x(n))⟩. Note that ~Q(x(n), a = i) denotes an state-action value given a state x(n) at
time n and a discrete action i .
Therefore, based on Q-learning (5–6), the update rule for KTD(λ) (3–39) can be
integrated as
f ← f + ηm∑n=1
[r(n + 1) + γmaxa
Q(x(n + 1), a)−Q(x(n), a(n))]
n∑k=1
λn−kϕ(x(k)). (5–8)
We call this approach Q-learning via kernel temporal difference (Q-KTD)(λ). For
single-step prediction problems (m = 1), (5–8) yields single updates for Q-KTD(λ) of the
form
Qi(x(n)) = η
n−1∑j=1
eTDi(j)Ik(j)κ⟨x(n), x(j)⟩. (5–9)
Here, Qi(x(n)) = Q(x(n), a = i) and eTDi(n) denotes the TD error defined as
eTDi(n) = ri + γQii(x(n + 1))−Qi(x(n)), (5–10)
and Ik(n) is an indicator vector with the same size as the number of outputs (actions).
This means that only the k th entry of the vector is set to 1 and the rest of the entries
are 0. The selection of the action unit k at time n can be based on a greedy method.
Therefore, only the weight (parameter vector) corresponding to the winning action gets
updated. Recall that the reward ri corresponds to the action selected by the current
policy with input x(n) because it is assumed that this action causes the next input state
x(n + 1).
78
The selection of the action unit k at time n is based on methods such as ϵ-greedy
and the Boltzmann distribution which are commonly applied for the action selection [53].
We adopt ϵ-greedy for our experiments. This is one of the most popular methods to
control the exploration and exploitation trade off. The action corresponding to the unit
with the highest Q value gets selected with probability 1− ϵ. Otherwise, any other action
is selected at random. In other words, the probability of selecting a random action is ϵ.
The structure of Q-learning based on KTD(0) is shown in Figure 5-1. The number
Figure 5-1. The structure of Q-learning via kernel temporal difference (λ)
of units (kernel evaluations) increases as more training data arrives. Each added unit is
centered at the previous input locations x(1), x(2), · · · , x(n − 1).
Likewise, Q-learning via correntropy temporal difference (Q-CTD) have the following
update rule
w ← w + η
m∑n=1
[exp
(−eTD(n)2
2h2c
)eTD(n)x(n)
], (5–11)
and Q-CKTD as
f ← f + η
m∑n=1
[exp
(−eTD(n)2
2h2c
)eTD(n)ϕ(x(n))
]. (5–12)
Here, the temporal difference error eTD is defined as
eTD(n) = r(n + 1) + γmaxa
Q(x(n + 1), a)−Q(x(n), a(n)). (5–13)
79
5.4 Reinforcement Learning Brain Machine Interface Based on Q-learning withFunction Approximation
We have seen how the agent and environment interact in the reinforcement learning
paradigm in Figure 2-1. Moreover in Figure 1-1, we have shown how the environment
can be conceived of the reinforcement learning brain machine interface (RLBMI)
paradigm. The TD algorithms we proposed help model the agent. Figure 3-1 shows how
state value function V can be estimated using the proposed TD algorithms under a fixed
policy. Note that the state value function approximation is only an intermediate step and
the form of policy is fixed.
In RLBMI, it is essential to find the policy which conveys the desired action on the
external device. Direct computation of the optimal policy is challenging since all the
information required to calculate the optimal policy is not known in practice. Therefore,
we estimate the optimal policy using the action value function Q. Figure 5-2 depicts the
RLBMI structure using Q-learning with the proposed TD algorithms.
Figure 5-2. The decoding structure of reinforcment learning model in a brain machineinterface using a Q-learning based function approximation algorithm.
Based on the neural state from environment, the action value function Q can be
approximated using an adaptive system. We proposed algorithms focusing on both
the functional mapping and the cost function. Kernel based representations have been
integrated to improve the functional mapping capabilities of the system, and correntropy
has been employed as the cost function to obtain robustness in the system. Based on
80
the estimated Q values, a policy decides a proper action. Note that the policy is the
learning policy which changes over time.
Recall that the main advantage of RLBMI is the co-adaptation between two
intelligent systems: the BMI decoder in the agent, and the BMI user in the environment.
Both systems learn how to earn rewards based on their joint behavior. The BMI decoder
learns a control strategy based on the user’s neural state and perform actions in
goal directed tasks that update the state of the external device in the environment. In
addition, the user learns the task based on the state of the external device. Both the BMI
decoder and the user receive feedback after each movement is completed and use this
feedback to adapt. Notice that both systems act symbiotically by sharing the external
device to complete their tasks, and this co-adaptation allows for continuous synergistic
adjustments of the BMI decoder and the user even in changing environments. In
Chapter 7, we will examine how this co-adaptation process works in practice by showing
experiments on real BMIs.
81
CHAPTER 6SIMULATIONS - POLICY IMPROVEMENT
In this chapter, we examine the empirical performances of the extended temporal
difference algorithms to the problem of finding a proper state to action mapping based
on the estimated action value function Q. In the following, we will not only assess their
performance and behavior but also examine the methods’s applicability to practical
situations. Note that in the following simulations, the block diagram of the agent remains
the same as in Figure 5-2; nonetheless, the components in the environment block are
indeed different. For instance, in the following mountain car problem, the states are
position and velocity, and the actions are the left and right accelerations as well as
coast.
6.1 Mountain Car Task
We first carry out experiments on a simple dynamic system which was first
introduced in [34]. This experiment is well known as “Mountain-car task,” a famous
episodic task in control problems. There is a car driving along a mountain track as
depicted in Figure 6-1, and the goal of this task is to reach the top of the right side
hill. The challenge in this task is that there are regions near the center of the hill
where maximum acceleration of the car is not enough to overcome the force imposed
by gravity, and therefore a more sophisticated strategy that allows the car to gain
momentum using the hill must be learned. Thus, if the system simply tries to maximize
short term rewards, it would fail to reach the goal. In this case, the only way to reach the
goal is to first accelerate backwards, even though it is further away from the goal, and
then drive forward with full acceleration. This is a representative example to evaluate the
system’s capability to find a proper policy to achieve a goal in RL.
The details of the model are based on [48]. The observed states correspond to the
following pair of continuous variables are position p(n) and velocity v(n) of the car. The
values are restricted to the intervals −1.2 ≤ p(n) ≤ 0.5 and −0.07 ≤ v(n) ≤ 0.07 for all
82
Figure 6-1. The Mountain-car task.
time n. The mountain altitude is sin(3p), and the state evolution dynamics are given by
v(n + 1) = v(n) + 0.001a(n)− g cos(3p(n)) (6–1)
p(n + 1) = p(n) + v(n + 1) (6–2)
where g represents gravity (g = 0.0025), and a(n) is a chosen action at time n. There
are 3 possible actions: accelerate backwards a = −1, coast a = 0, and accelerate
forward a = +1. At each time step, reward r = −1 is assigned, and once the updated
position p(n + 1) exceeds 0.5, the trial terminates.
We undergo 30 trials to learn the policy. At each trial, the initial states are drawn
randomly from −1.2 ≤ p ≤ 0.5 and −0.07 ≤ v ≤ 0.07. The system is initialized when the
first trial starts, and each trial has the maximum number of steps as 104. At each trial,
the number of steps is counted, and it is averaged over the 30 trials and 50 Monte Carlo
runs. For each Monte Carlo run, the same set of 30 initial values is used. In addition, for
the ϵ-greedy method, we apply exploration rate ϵ = 0.05.
First, we apply Q-TD(λ) to find the state action map, and the performances of
different combinations of parameters (λ = 0, 0.2, 0.4, 0.6, 0.8, 1 and η = 0.1, 0.3, 0.5, 0.7, 0.9)
are observed (Figure 6-2). In general, as λ gets larger, the performance worsens.
The large mean and standard deviation appear when the car gets stuck in the valley
83
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2000
0
2000
4000
6000
8000
10000
12000
η=0.1
η=
η=
η=
η=
Figure 6-2. Performance of Q-TD(λ) with various combination of λ and η.
(p ≈ −0.5), so it fails to reach the goal within the maximum step limit, 104. Note that
in this task, the input state space is continuous, so the there are an infinite number of
states, and using the position-velocity representation certainly does not fulfill the linear
independence criterion.
Attempts to make Q-TD(λ) applicable in continuous input space by discretizing
the state space are usually considered. For instance, placing overlapping tiles to
partition the input space, a process called “tile coding,” is a usual approach to provide a
representation that would be expected to do a better job. Examples where we can see
the performance of TD including this preprocessing method can be found in [16, 48].
However, proper state representations are difficult to obtain because they require prior
information about the state space. It is here where we believe Q-KTD(λ) can provide an
advantage.
For Q-KTD(λ), we employ the Gaussian kernel (4–3). From the Q-TD(λ) application,
it is observed that kernel size h = 0.2 which is close to the heuristic that uses the
distribution of squared distance between pairs of input states. To confirm the usefulness
of these values, we apply different kernel sizes (h = 0.01, 0.05, 0.1, 0.2, 0.3, 0.4), and the
mean number of steps per trial is observed. This mean is the average over 30 trials and
50 Monte Carlo runs. For this evaluation, we fix λ = 0.4 and η = 0.5.
Kernel size h = 0.05 shows the lowest mean number of steps per trial, but
performances are not significantly different for a broader range of parameter values
that include h = 0.2, which is the largest kernel size that exposes good performance.
84
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4150
200
250
300
350
400
450
Kernel size, h
Ave
rage
num
ber
of s
teps
per
tria
l
Figure 6-3. The performance of Q-KTD(λ) with respect to different kernel sizes.
Again, a preference for a larger kernel size is motivated by the smoothness assumption.
The performance of Q-KTD(λ) with a different combination of λ and η are observed.
Here, the same combination as Figure 6-2 is tested.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2000
0
2000
4000
6000
8000
10000
12000
η=0.1
η=
η=
η=
η=
Figure 6-4. Performance of Q-KTD(λ) with various combination of λ and η.
Based on Figure 6-2, 6-3, and 6-4, the optimal parameters for Q-TD and Q-KTD can
be obtained (λ = 0.4 and η = 0.5 for Q-TD and λ = 0, η = 0.3, and h = 0.2 for Q-KTD).
With the selected parameters, we further compare the performances of Q-TD(λ) and
Q-KTD(λ). First, at each trial, we count the number of iterations until the car reach the
goal, and then we average the number of iterations per trial over 30 trials and 50 Monte
Carlo runs. Figure 6-5 shows the relative frequency with respect to the average number
of iterations per trial. For better understanding, Figure 6-6 plots the average number
of iterations per trial with respect to the trial number. Note that the x-axis of Figure
6-5 corresponds to y-axis of Figure 6-6. The results shows that both Q-TD(λ) and
85
Figure 6-5. Relative frequency with respect to average number of iterations per trial ofQ-TD(λ) and Q-KTD(λ).
Figure 6-6. Average number of iterations per trial of Q-TD(λ) and Q-KTD(λ).
Q-KTD(λ) are able to find a proper policy. However, compared to Q-TD(λ), Q-KTD(λ)
works better for policy improvement. Q-KTD(λ) has more trials with less number of
iterations (Figure 6-6). In addition, the large number of iterations in Figure 6-5 is due to
exploration at the initial stage of learning.
In the state value estimation problems (Boyan chain experiments in the previous
chapters), we have seen the robustness of maximum correntropy criteria (MCC) under
different types of perturbation on the policy or environment; namely, the noisy reward
and state transition probability. Here, we will see the usefulness of correntropy for
learning under switching policies, which can be the case in implementing an exploration
86
/ exploitation trade off in reinforcement learning. TD algorithms integrated with
correntropy can provide better performance under such learning scenarios.
When we try to obtain a good policy without any prior knowledge of how the optimal
policy should be, the system is required to learn by exploring the environment. Thus,
at each time, the system observes certain state to action maps from experience, and
the system needs to evaluate the given policy to update the functional mapping; that is,
it is essential that the system is able to learn under changing policies. Therefore, here
we will observe how the proposed algorithms can efficiently learn a good policy while
constantly changing policies during the learning process.
We use the Mountain-car task and vary the exploration rate to confirm how the
system learns under changing policy. We start with a totally random policy (100%
exploration rate, ϵ = 1). This exploration rate is kept until 200th step and then we switch
to ϵ = 0. When the exploration rate ϵ is 0, the observed performance shows exactly what
the system has been able to learn from random exploration. In addition, further steps
are allowed to let the system adjust its current estimate of the policy.
By keeping the optimal parameters of Q-KTD η = 0.3 and h = 0.2, we examine
Q-CKTD with different correntropy kernel sizes (hc = 1, 2, 3, 4, 5, 10, 50). Figure 6-7
shows the average number of steps per trial over the 30 trials and 50 Monte Carlo
runs. In the case of hc = 3, Q-CKTD results shows a mean and standard deviation
of 349.8713 ± 368.0790, whereas Q-KTD shows 558.5773 ± 1012.3 with the optimal
parameters. This observation reveals the positive effect that robustness of correntropy
as a cost function brings to learning under changing policies. For better understanding,
we further observe the average step number at each trial over 50 Monte Carlo runs
(Figure 6-8). Note that the same 30 initial states are applied for the 50 Monte Carlo runs.
Q-CKTD takes a larger number of steps at the beginning, but as learning progresses
(trial number increases), it requires a significantly fewer steps per trial. We can also see
that the system adapts to the environment and is able to find a better policy. Note that
87
100
101
340
360
380
400
420
440
460
480
500
Ave
rage
num
ber
of s
teps
per
tria
l
Correntropy kernel size, hc
Figure 6-7. The performance of Q-CKTD with different correntropy kernel sizes.
0 5 10 15 20 25 300
200
400
600
800
1000
1200
1400
Trial number
Ave
rage
num
ber
of s
teps
per
tria
l
Q−KTDQ−CKTD
Figure 6-8. Average number of steps per trial of Q-KTD and Q-CKTD.
until the 200th step, the policy is completely random, and thus, both Q-KTD and Q-CKTD
show an average number of steps larger than 200. The trials that reach the goal even
under a random policy are able to do so because their initial positions are close enough
to the goal.
6.2 Two Dimensional Spatial Navigation Task
We have observed the benefits of using the kernel base representations in practical
applications. Before applying Q-KTD(λ) and Q-CKTD to neural decoding in brain
machine interfaces, we present some results on a simulated, 2-dimensional spatial
navigation task. This simulation will provide insights about how the system will perform
88
in further practical experiments. This simulation shares some similarities with the
neural decoding experiment; based on the input states, the system predicts which
direction should follow, and depending on the updated position, the next input states
are provided. The goal is to reach a target area where a positive reward is assigned.
No prior information of the environment is given; the system is required to explore the
environment to reach the target.
This simulation is a modified version of the maze problem in [14]. In our case there
is a 2-dimensional state space that corresponds to a square with a side-length of 20
units. The goal is to navigate from any position on the square to a target located within
the square. In our experiments, one target is located at the center of the square (10, 10)
and any approximations within a 2 unit radius are considered successful. A fixed set
of 25 points distributed in a lattice configuration are taken as initial seeds for random
initial states. Each initial state corresponds to drawing randomly one of these 25 points
with equal probability. The location of the selected point is further perturbed with unit
variance, zero mean additive Gaussian noise, G(0, 1). To navigate through the state
space, we can choose to move 3 units of length in one of the 8 possible directions that
are allowed. The maximum number of steps per trial is limited to 20.
The agent gets a reward +0.6 every time it reaches the goal, and then a new trial
starts. Otherwise, a reward −0.6 is given. Exploration rate of ϵ = 0.05 and discount
factor γ = 0.9 are used. The kernel employed is the Gaussian kernel with size h = 4.
This kernel size is selected based on the distribution of squared distance between pairs
of input states.
To assess the performance, we count the number of trials which earned the positive
reward within a group of 25 trials; that is, every 25 trials, we calculate the success rate
of the learned mapping as (# of successful trials)/25. To help in understanding the
behavior and illustrating the role of the parameters, with a fixed kernel size of 4, the
89
performance over the various stepsizes (η = 0.01, 0.1 ∼ 0.9 with 0.1 intervals) and
values of eligibility trace rate (λ = 0, 0.2, 0.5, 0.8, 1) are shown in Figure 6-9.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
Stepsizes
Suc
cess
Rat
es
lambda = 0lambda = 0.2lambda = 0.5lambda = 0.8lambda = 1
Figure 6-9. The average success rates over 125 trials and 50 implementations.
The stepsize η mainly affects the speed of learning, and within the stability limits,
larger stepsizes provide faster convergence. However, due to the effect of eligibility trace
rate λ, the stability limits suggested in [28] must be adjusted accordingly
η <N
tr [Gϕ]=
N∑N
j=1 κ(x(j), x(j))≤ N∑N
j=1(λ0 + · · ·+ λm−1)κ(x(j), x(j))
. (6–3)
This upper bound assumes that a maximum number of steps per trial m has been
reached, and in the case of the Gaussian kernel, the bound becomes 1/(λ0+ · · ·+λm−1).
Hence, for larger λ values, the stable stepsizes η lie in a smaller interval; for λ =
0, 0.2, 0.5, 0.8, 1, the stable stepsizes η lie below 1, 0.8, 0.5, 0.2, 0.05, respectively.
The trade-off between λ and η is observed in Figure 6-9. It is clear how these
parameters can be associated with the speed of learning. At intermediate points on the
horizontal axis, the influence of λ becomes relevant as the stepsize decreases. On the
other hand, if the stepsize increases, we can see how performance degrades for larger
values of λ since the bound set by (6–3) is not satisfied.
The relation between final filter size versus stepsize and eligibility trace rate is
also plotted in Figure 6-10. Since each trial allows a maximum number of 20 steps, the
90
largest final filter size is 2500. However, with a good adaptive system, the final filter size
can be reduced. The final filter size corresponds inversely with the success rates (Figure
6-9 and 6-10). High success rates mean that a system has learned the state-action
mapping, whereas a system that has not adapted to new environment, keeps exploring
the space. Therefore, high success rates will correspond to small filter sizes and vice
versa.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
Stepsizes
Fin
al F
ilter
Siz
es
lambda = 0lambda = 0.2lambda = 0.5lambda = 0.8lambda = 1
Figure 6-10. The average final filter sizes over 125 trials and 50 implementations.
Both average success rate and final filter size show that η = 0.9 and λ = 0 have
the best performance. With the selected parameters, the success rates approach
to over 95% after 100 trials. From Figure 6-11, we can observe how learning is
accomplished. At the beginning, the system explores more of the space based on
the reward information, and the trajectories looks rather erratic. Once the system starts
learning, actions corresponding to states near the target point toward the reward zone,
and as time goes by this area becomes larger and larger until it covers the whole state
space.
The blue starts represent the 25 initial states, and green arrows shows the action
chosen at each state. Red dot at the center is the target and red circle shows the reward
zone.
91
Figure 6-11. Two dimensional state transitions of the first, third, and fifth sets withη = 0.9 and λ = 0.
Kernel methods are powerful for solving nonlinear problems, but the growing
computational complexity and memory size limit their applicability in practical scenarios.
To overcome this, we also show how the quantization approach presented [9] can be
employed to ameliorate the limitations imposed by growing filter sizes. For a fixed set
of 125 inputs, we consider quantization sizes ϵU = 40, 30, 20, 10, 5, 2, 1. Figure 6-12
shows the effect of different quantization sizes on the final performance. Notice that the
minimum size for stable performance of the filter is reached around approximately 60
units. Therefore, the quantization size ϵU can be selected, and the maximum success
rate is still being achieved (see Figure 6-13).
0 50 100 150 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Final Filter Sizes
Suc
cess
Rat
es
Figure 6-12. The average success rates over 125 trials and 50 implementations withrespect to different filter sizes.
92
0 20 40 60 80 100 1200
0.5
1
The Number of Trials
Suc
cess
Rat
es0 20 40 60 80 100 120
0
20
40
60
80
The Number of Trials
Filt
er S
izes
Figure 6-13. The change of success rates (top) and final filter size (bottom) with ϵU = 5.
Let us now compare the performance between Q-KTD and Q-CKTD. Based on
Figure 6-9 and 6-10, we select λ = 0, η = 0.9, kernel size h = 4 for both algorithms.
In the case of Q-CKTD, the correntropy kernel size hc = 10 is selected by visual
inspection. Table 6-1 shows the average success rate of Q-KTD and Q-CKTD. Note that
the average values correspond to 50 Monte Carlo runs using 125 trials per run.
Table 6-1. The average success rate of Q-KTD and Q-CKTD.mean standard deviation
Q-KTD 0.7019 0.0674Q-CKTD 0.7248 0.0455
As we can seee, the Q-CKTD algorithm shows higher average success rates as
well as smaller variance among runs. Figure 6-14 depicts the evolution of the average
success rates along with their variance estimates across 50 Monte Carlo runs. Every
25 trials, we count the number of trials that earned positive reward within the 25 trial
intervals. Both algorithms, Q-KTD and Q-CKTD, show similar performance at the very
beginning. However, as the number of trials increase, Q-CKTD displays higher average
success rates than Q-KTD. These differences are more noticeable at the 50th and
100th trials. In addition, it is also important to highlight the behavior of the standard
deviation for Q-CKTD, which decreases much faster than Q-KTD as the number of trials
93
increases. These results show that the robustness of the correntropy criterion as a cost
function can help in learning the policy.
0 25 50 75 100 125 1500
0.2
0.4
0.6
0.8
1
Trial Numbers
Suc
cess
Rat
es
Q−KTDQ−CKTD
Figure 6-14. The change of average success rates by Q-KTD and Q-CKTD.
In this chapter, we tested Q-KTD(λ) and Q-CKTD on synthetic experiments to
find a good policy based on the approximation of the action value function Q in RL.
We saw that Q-KTD(λ) provided stable performance in continuous state spaces and
found good state to action mappings. In addition, we observed that the robust nature
of Q-CKTD helped improve performance under changing policies. Experimental results
also provided insights on how to perform parameter selection. In addition, we showed
how the quantization approach could be successfully applied to control the growing filter
size. The results showed that the method was able to find a good policies and to be
implemented in more realistic scenarios.
94
CHAPTER 7PRACTICAL IMPLEMENTATIONS
In the previous chapters we use the Boyan Chain problem to elucidate the
properties of the different proposed algorithms when estimating state value functions.
We observed both linear and nonlinear capabilities in KTD(λ). Given the appropriate
kernel size, KTD should be able to approximate both linear and nonlinear functions. In
addition, the Mountain-car and 2-dimensional spatial navigation experiments showed
the advantages of Q-KTD(λ) in continuous state spaces where the number of states
is essentially infinite. The use of kernels allows arbitrary input spaces and works with
little prior knowledge of policy. Q-KTD(λ) is a simple yet powerful algorithm to solve RL
problems. Our ultimate goal is to show that KTD(λ) can work in more realistic scenarios.
To illustrate this, we present a relevant signal processing application in brain machine
interfaces.
In our RLBMI experiments, we use monkeys’ neural signal to map an action
direction (computer cursor position / robot arm position). The agent starts at a naive
state, but the subject has been trained to receive rewards from the environment. Once
it reaches the assigned target, the system and the subject earn a reward, and the agent
updates its decoder of brain activity. Through iteration, the agent learns how to correctly
translate neural states into action-direction.
7.1 Open Loop Reinforcement Learning Brain Machine Interface: Q-KTD(λ)
We first apply the neural decoder on open loop RLBMI experiments; the algorithm
learns based on the monkey’s neural states to find a proper mapping to actions while
the monkey is conducting a goal reaching task. However, the output of the agent does
not directly change the state of the environment because this is done with pre-recorded
data. The external device is updated based only on the actual monkey’s physical
response. Thus, if the monkey conducts the task properly, the external device reaches
the goal. In this sense, we only consider the monkey’s neural state from successful trials
95
to train the agent. The goal of this experiment is to evaluate the system’s capability to
predict the proper state to action mapping based on the monkey’s neural states and to
assess the viability of further closed loop RLBMI experiments.
7.1.1 Environment
The data employed in these experiments is provided by SUNY Downstate Medical
Center. A female bonnet macaque is trained for a center-out reaching task allowing
8 action directions. After the subject attains about 80% success rate, micro-electrode
arrays are implanted in the motor cortex (M1). Animal surgery is performed under the
Institutional Animal Care and Use Committee (IACUC) regulations and assisted by the
Division of Laboratory Animal Resources (DLAT) at SUNY Downstate Medical Center. A
set of 185 units are obtained after sorting from 96 channels, and the firing times of these
units are the ones used for the neural decoding; the neural states are represented by the
firing rates on a 100ms window.
There is a set of 8 possible targets and 8 possible action directions. Every
trial starts at the center point, and the distance from the center to each target is
4cm; anything within a radius of 1cm from the target point is considered as a valid
reach(Figure 7-1).
Figure 7-1. The center-out reaching task for 8 targets.
7.1.2 Agent
In the agent, Q-learning via kernel temporal difference (Q-KTD)(λ) is applied to
neural decoding. After the neural states are preprocessed by normalizing their dynamic
range to lie between −1 and 1, they are input to the system. Based on the preprocessed
neural states, the system predicts which direction the computer cursor will be updated.
96
Each output unit represents one of the 8 possible directions, and among the 8 outputs
one action is selected by the ϵ-greedy method [56]. The performance is evaluated by
checking whether the updated position reaches the assigned target, and depending on
the updated position, a reward value will be assigned to the system.
7.1.3 Center-out Reaching Task - Single Step
First, we observe the behavior of the algorithms on a single step reaching task.
This means that rewards from the environment are received after a single step and one
action is performed by the agent per trial. The assignment of reward is based on the
1− 0 distance to the target, that is, dist(x , d) = 0 if x = d , and dist(x , d) = 1, otherwise.
Once the cursor reaches the assigned target, the agent gets a positive reward (+0.6),
otherwise it receives negative reward (−0.6) [41]. Based on the selected action with
exploration rate ϵ = 0.01, and the assigned reward value, the system is adapted as in
Q-learning via kernel TD(λ) with γ = 0.9. In our case, we can consider λ = 0 since our
experiment performs single step updates per trial.
In this experiment, the firing rates of the 185 units on 100ms windows are time
embedded using 6th order tap delay, this creates a representation space where each
state is a vector with 1295 dimensions.
The simplest version of the problem limits the number of targets to 2 (right and left),
and the targets should be reached within a single step. The time delayed neural net
(TDNN) has already been applied to RLBMI experiments, and its applicability in neural
decoding has been validated in [13, 31]. Thus, the performance of the Q-KTD algorithm
is compared with a TDNN as a mapper. The total number of trials is 43 for the 2 targets.
For Q-KTD, we employ the Gaussian kernel (4–3), and the kernel size h is heuristically
chosen based on the distribution of the mean squared distance between pairs of input
states; let s = E [∥xi − xj∥2)], then h =√s/2. For this particular dataset, the above
heuristic gives a kernel size h = 7. The stepsize η = 0.3 is selected based on the
stability bound that was derived for KLMS [28],
97
η <N
tr [Gϕ]=
N∑N
j=1 κ(x(j), x(j))= 1. (7–1)
The initial TD error is set to zero, and the first input vector is assigned as the first
unit’s center. After 43 trials, we count the number of trials which received a positive
reward, and the success rate is averaged over 50 Monte Carlo runs. Figure 7-2 shows
the average learning rates of Q-learning via KTD(0) and TDNN.
Figure 7-2. The comparison of average learning curves from 50 Monte Carlo runsbetween Q-KTD(0) and MLP.
KTD(0) reaches around 100% success rate after 2 epochs. In contrast, the average
success rate of TDNN slowly increases yet never reaches the same performance as
KTD. The solid line shows the mean success rates and the dashed line shows the
confidence interval based on the standard deviation. Since all the parameters are fixed
over 50 Monte Carlo runs, the confidence interval for KTD(0) can be simply associated
with the random effects introduced by the ϵ-greedy method employed for action selection
with exploration; thus, the narrow interval. However, with the TDNN a larger variation
of performance is observed, which shows how the initialization, due to local minima,
influences the success of learning; it is observed that the TDNN is able to approximate
the KTD performance, but most of the time, the system is stuck on local minima. From
this result, we can highlight one of the advantages of KTD(0) compared to MLPs, which
is the insensitivity to initialization.
98
However, one apparent disadvantage of using a nonparametric approach such
as KTD is the growing filter structure, which is considered a prohibitive constraint for
practical applications; the filter size increases linearly with the input data, which in an
online scenario is prohibitive. Therefore, methods for controlling the growth of the filter
are necessary; fortunately, there exists methods to avoid this problem such as the
surprise measure [25] or the quantization approach [9], which are incorporated in our
algorithm for the 2 target center-out reaching tasks. Without controlling the filter size, the
success rates reach around 100% within 3 epochs, but within only 20 epochs, the filter
size becomes as large as 861 units. Using the surprise measure [25], the filter size can
be reduced to 87 centers with acceptable performance. However, quantization method
[9] allows the filter size to be reduced to 10 units and to have performance above 90%
success rate. Therefore, more experiments applying the quantization approach are
conducted. Figure 7-3 shows the effect of filter size in the 2 target experiment.
Figure 7-3. The average success rates over 20 epochs and 50 Monte Carlo runs withrespect to different filter sizes.
For filter sizes as small as 10 units, the average success rates remain stable. Thus,
filter size 10 can be chosen when efficient computation is necessary. Figure 7-4 shows
the learning curves corresponding to different filter sizes in comparison with TDNN.
The average success rates are computed over 50 Monte Carlo runs.
99
Figure 7-4. The comparison of KTD(0) with different final filter sizes and TDNN with 10hidden units.
As we pointed out, in the case of total filter size of 10 (red line), the algorithm shows
almost the same learning speed as the linearly growing filter size, with success rates
above 90%. When we compare the average learning curves to TDNN, even a filter with 3
units (magenta line) using KTD(0) performs better than TDNN.
In the 2 target single step center out reaching task, Q-KTD(0) showed promising
results solving the initialization and growing filter size issues. Further analysis of
Q-KTD(0) is conducted on a more difficult task involving a larger number of targets.
All the experimental values are kept fixed using the same set up from the above
experiments. The only changes are the number of targets from 2 to 8 (1 ∼ 8) and
stepsize η = 0.5.
Since the total number of trials is 178 in this experiment, without any mechanism to
control the filter size, the filter structure can grow up to 1780 units within 10 epochs. The
quantization approach [9] is again applied to reduce the filter size. Intuitively, there is an
intrinsic relation between quantization size ϵU and kernel size h. Consequently, based on
the distribution of squared distance between pairs of input states, various kernel sizes
(h = 0.5, 1, 1.5, 2, 3, 5, 7) and quantization sizes (ϵU = 1, 110, 120, 130) are tested. The
100
corresponding success rates for final filter sizes of 178, 133, 87, and 32 are displayed in
Figure 7-5.
Figure 7-5. The effect of filter size control on 8-target single-step center-out reachingtask. The average success rates are computed over 50 Monte Carlo runsafter the 10th epoch.
Again, since all the parameters are fixed over the 50 Monte Carlo runs, the narrow
error bars are due to the random action selection for exploration, and this small variation
supports that this kernel approach does not heavily depend on initialization unlike
the conventional TD learning algorithms such as neural nets. With a final filter size of
178 (blue line), the success rates are superior to any other filter sizes for every kernel
sizes tested, since it contains all the input information. Especially for small kernel
sizes (h ≤ 2), success rates above 96% are observed. Moreover, note that even
after reduction of the state information (red line), the system still produces acceptable
success rates for kernel sizes ranging from 0.5 to 2 (around 90% success rates).
Intuitively the largest kernel sizes that provide good performance are better for
generalization; in this sense, a kernel size h = 2 is selected since this is the largest
kernel size that considerably reduces the filter size and yields a neural state to action
mapping that performs well (around 90% of success rates). In the case of kernel size
h = 2 with final filter size of 178, the system reaches 100% success rates after 6 epochs
101
with a maximum variance of 4% (Figure 7-6). To observe the learning process, success
rates are calculated after each epoch (1 epoch contains 178 trials).
Figure 7-6. The average success rates for various filter sizes.
The 8-target experiment shows the effect of the filter size, and how it converges
after 6 epochs (Figure 7-6). As we can see from the number of units in both cases,
higher representation capacity is required to obtain the desired performance as the
task becomes more complex (Figure 7-4 and 7-6). The results of the algorithm on the
8-target center-out reaching task showed that the method can effectively learn the
brain-state action mapping for this task and is still feasible.
7.1.4 Center-out Reaching Task - Multi-Step
Here, we want to develop a more realistic scenario. Therefore, we extend the task
to multi-step and multi-target experiments. This case allows us to explore the role of
the eligibility traces in Q-KTD(λ). The price paid for this extension is that now, lambda
0 < λ < 1 selection needs to be carried out according to the best observed performance.
Testing based on the same experimental set up as with the single step task, that is, a
discrete reward value is assigned at the target, causes extremely slow learning since
no guidance is given. The system requires long periods of exploration until it actually
reaches the target. Therefore, we employ a continuous reward distribution around the
102
selected target defined by the following expression:
r(s) =
prewardG(s) if G(s) > 0.1,
nreward if G(s) ≤ 0.1.
where G(s) = exp[(s − µ)⊤C−1
θ (s − µ)] (7–2)
where s ∈ R2 is the position of the cursor, preward = 1, and nreward = −0.6. The mean
vector µ correspond to the selected target location and the covariance matrix
Cθ = Rθ
7.5 0
0 0.1
R⊤θ where Rθ =
cos θ sin θ
− sin θ cos θ
depends on the angle θ of the selected target as follows: for targets one and five the
angle is 0, two and six −π/4, three and seven π/2, and four and eight π/4. Figure 7-7
shows the reward distribution for target one.
Figure 7-7. Reward distribution for right target.
The same form of distribution is applied to the other directions centered at the
assigned target point. The black diamond is the initial position, and the purple diamond
shows the possible directions including the assigned target direction (red diamond).
Once the system reaches the assigned target, the system earns a maximum reward
of +1, and receives partial rewards according to (7–2) during the approaching stage.
103
When the system earns the maximum reward, the trial is classified as a successful trial.
The maximum number of steps per trial is limited such that the cursor must approach
the target on a straight line trajectory. Here, we also control the complexity of the task
by allowing different number of targets and steps. Namely, 2-step 4-target (right, up,
left, and down); and 4-step 3-target (right, up, and down) experiments are performed.
Increasing the number of steps per trial amounts to making smaller jumps according
to each action. After each epoch, the number of successful trials are counted for each
target direction. Figure 7-8 shows the learning curves for each target and the average
success rates.
A 2-step 4-target B 4-step 3-target
Figure 7-8. The learning curves for multi step multi target tasks.
Larger number of steps results in lower success rates. However, the two cases
(two and four steps) obtain an average success rate above 60% for 1 epoch. This result
suggests that the algorithms could be applied in online scenarios. The performances
show all directions can achieve success rates above 70% after convergence.
7.2 Open Loop Reinforcement Learning Brain Machine Interface: Q-CKTD
We have already seen the performance of Q-KTD(λ) to find an optimal neural to
motor mapping. In this section, we want to compare the performance of Q-learning
via KTD(λ) and CKTD. Both algorithms are applied to passive data on a center-out
reaching task aiming at 4 targets (right, up, left, and down). The difference between
104
the passive data and the data employed in the previous section is that in the passive
data the monkey does not perform any movement, it only observes how the position
of a cursor changes over time. Neural states are recorded while the monkey watches
the screen changing through the duration of the experiment. Spike times from 49 units
are converted to firing rates using a 100ms window, and a 9th order tap delay line is
applied to input, hence, 490 dimensions are used to represent the neural states. The
total number of trials is 144, and each trial is initialized at the center, and allows 2 steps
to approach the target. The distance between the initial point (center) and the target can
be covered in 1 step. A trial is terminated once it passes 2 steps or receives positive
reward +1.5. Here, the positive reward value is assigned when the cursor reaches the
reward zone (0.2 distance from an assigned target). Otherwise, it earns negative reward
−0.6. The discount factor is γ = 0.9, the exploration rate ϵ = 0.01, and stepsize η = 0.5.
In these experiments, we do not apply any filter size control. The kernel size for
KTD is chosen based on the distribution of squared distance between pairs of input
states resulting in h = 0.8. When we fix the filter kernel size to h = 0.8 and apply
Q-CKTD, there is no significant difference between Q-KTD and Q-CKTD. However, by
changing the filter kernel size to 1, Q-CKTD shows improvement over KTD. Here, the
correntropy kernel size is hc = 1.
The success rates at each epoch are obtained as the average number of successful
trials over the 4 targets. These success rates are further estimated by averaging over 50
Monte Carlo runs; the results are displayed in Figure 7-9.
The average success rates between Q-KTD with filter kernel size h = 0.8 and
Q-CKTD with filter kernel size h = 1 and correntropy kernel size hc = 1 are compared
(Figure 7-9 (a)). CKTD shows improved success rates for the 1st and 2nd epochs.
However, the success rates after the 3rd epoch remain essentially equal to those for
Kernel TD. Since correntropy KTD weights are a combination of the error values e with
the level of importance based on the κ(e), and the error distribution changes during
105
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Epochs
Suc
cess
Rat
es
kernel TDCorrentropy KTD
A Correntropy KTD with fixed correntropykernel size 1.
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Epochs
Suc
cess
Rat
es
kernel TDCorrentropy KTD
B Correntropy KTD with reduced kernel sizefrom 1 to 0.8 at 3rd epoch.
Figure 7-9. Average success rates over 50 runs. The solid line shows the mean successrates, and the dashed line shows the standard deviation.
the learning process, it is reasonable to assume that the size of correntropy kernel
may need to be adjusted as learning progresses. A principled method to select the
correntropy kernel size is still under development; however, we chose to manually set
changes in the correntropy kernel size by observing the evolution of the errors. The
correntropy kernel size hc is reduced from 1 to 0.8 at the 3rd epoch. As we predicted,
improvement in success rates is observed at the 3rd and 4th epochs (Figure 7-9 (b)).
This motivates further work on fine tuning the correntropy kernel size, and thus some of
the effort will be devoted to this issue.
To understand better the properties of Q-KTD and Q-CKTD, we observe the
behavior of other quantities such as the actual predictions of the Q-values and individual
success rates according to each target. These quantities are observed by employing the
same parameter set as in Figure 7-9 (b), but the results are obtained from a single run.
First, the Q-value changes are observed at each trial (Figure 7-10). Correntropy
KTD has slower convergence in Q-values than Kernel TD. However, correntropy KTD
shows higher success rates over time. In addition, when we check the Q-value changes
at the 1st epoch (Figure 7-11 (a) and (b)), correntropy KTD has higher values, and it
attempts to explore more directions during learning. Since the positive reward is 1.5 and
106
0 2 4 6 8 100
0.2
0.4
0.6
0.8
1
Suc
cess
Rat
es
Epochs
0 200 400 600 800 1000 1200 1400−1
0
1
2
Q−
valu
es
Trial Numbers
A Kernel TD.
0 2 4 6 8 100
0.2
0.4
0.6
0.8
1
Suc
cess
Rat
es
Epochs
0 200 400 600 800 1000 1200 1400−1
0
1
2
Q−
valu
es
Trial Numbers
B Correntropy KTD.
Figure 7-10. Q-value changes per tiral during 10 epochs.
the Q-value represents expected reward given state and action, it is desirable as a value
predictor for it to converge to 1.5. Although the Q-values predicted by Kernel TD are
closer to the positive reward 1.5 (Figure 7-11 (c) and (d)), the variance of the Q-value
does not affect the success rates. This leaves an open question about what properties
of correntropy may be involved in this behavior, and it becomes an important reason to
carry out further analysis in order to fully understand the algorithm.
The success rate of each target is observed from 1st to 5th epoch (Figure 7-12).
Target indices 1, 3, 5, and 7 represent right, up, left, and down respectively. When we
apply Kernel TD, at the beginning, the learning direction tends to focus on certain
directions; during the first epoch the agent mainly learns the down direction (target
index 7), and during the second epoch the learning inclines towards the left direction
(target index 5) (Figure 7-12(a)). However, the learning variation over each direction in
correntropy KTD is smaller in comparison with Kernel TD (Figure 7-12 (b)).
7.3 Closed Loop Brain Machine Interface Reinforcement Learning
Q-KTD(λ) has been tested on open loop RLBMI experiments, and we have seen
that the algorithm performs well on the open loop RLBMI experiment. Therefore, the
application has progressed to closed loop RLBMI experiments. In closed loop RLBMI
experiments, the agent is trained to find a mapping from the monkey’s neural states
107
20 40 60 80 100 120 1400
1
2
3
4
5
6
7
8
Targ
et
Inde
xTrial Numbers
20 40 60 80 100 120 140−4
−2
0
2
4
6
8
10x 10
−3
Q−
valu
es
Trial Numbers
Target1
Target2
Target3
Target4
Target5
Target6
Target7
Target8
A At 1st epoch by KTD.
20 40 60 80 100 120 1400
1
2
3
4
5
6
7
8
Targ
et
Inde
x
Trial Numbers
20 40 60 80 100 120 140−0.02
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Q−
valu
es
Trial Numbers
Target1
Target2
Target3
Target4
Target5
Target6
Target7
Target8
B At 1st epoch by Correntropy KTD.
1300 1320 1340 1360 1380 1400 1420 14400
1
2
3
4
5
6
7
8
Targ
et
Index
Trial Numbers
1300 1320 1340 1360 1380 1400 1420 1440−0.5
0
0.5
1
1.5
2
Q−
va
lues
Trial Numbers
Target1
Target2
Target3
Target4
Target5
Target6
Target7
Target8
C At 10th epoch by KTD.
1300 1320 1340 1360 1380 1400 1420 14400
1
2
3
4
5
6
7
8
Targ
et
Index
Trial Numbers
1300 1320 1340 1360 1380 1400 1420 1440−0.5
0
0.5
1
1.5
2
Q−
va
lues
Trial Numbers
Target1
Target2
Target3
Target4
Target5
Target6
Target7
Target8
D At 10th epoch by Correntropy KTD.
Figure 7-11. Target index and matching Q-values.
to a robot arm position. The monkey has been trained to associate its neural states
with a particular task goal. The behavior task is a reaching task using a robotic arm, in
which the decoder controls the robot arm’s action direction by predicting the monkey’s
intent based on its neuronal activity. If the robot arm reaches to an assigned target, a
reward will be given to both the monkey(food reward) and the decoder(positive value).
Notice that the two intelligent systems learn co-adaptively to accomplish the goal. These
experiments are conducted in cooperation with the Neuroprosthetics Research Group
at the University of Miami. The performance is evaluated in terms of task completion
accuracy and speed. Furthermore, we attempt to evaluate the individual performance of
each one of the systems in the RLBMI.
7.3.1 Environment
During pre-training, a marmoset monkey has been trained to perform a target
reaching task aimed at two spatial locations (A or B trial); the monkey was taught to
associate changes in motor activity during A trials, and produce static motor responses
108
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
1st Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
2nd Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
3rd Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
4th Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
5th Epoch
A KTD.
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
1st Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
2nd Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
3rd Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
4th Epoch
1 3 5 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Target Index
Success R
ates
5th Epoch
B Correntropy KTD.
Figure 7-12. The success rates of each target over 1 through 5 epochs.
during B trials. When one target is assigned, the trial starts with a beep. To conduct the
trial during the user training phase, the monkey is required to steadily place its hand on
a touch pad for 700 ∼ 1200ms . This action produces a go beep that is followed by one
of the two target LEDs being lit on (A trial: red light for left direction or B trial: green light
for right direction). The robot arm goes up to home position, namely, the center position
between the two targets. Its gripper shows an object (food reward such as waxworm
or marshmallow for A trial and undesirable object (wooden bead) for B trial). For the A
trial, the monkey should move its arm to a sensor within 2000ms , and for the B trial, the
monkey should hold its arm on the initial sensor for 2500ms. If the monkey successfully
conducts the task, the robot arm moves to the assigned direction, the target LED light
blinks, and the monkey gets the food reward.
After the monkey is trained to perform the assigned task properly, a micro-electrode
array (16-channel tungsten microelectrode arrays, Tucker Davis Technologies, FL) is
surgically implanted under isoflurane anesthesia and sterile conditions. In the closed
loop RLBMI, neural states from the motor cortex (M1) are recorded. These neural states
become inputs to the neural decoder. All surgical and animal care procedures were
109
consistent with the National Research Council Guide for the Care and Use of Laboratory
Animals and were approved by the University of Miami Institutional Animal Care and Use
Committee. In the closed loop experiment, after the initial holding time that produces
the go beep, the robotic arm’s position is updated based solely on the monkey’s neural
states, and the monkey is not required to perform any movement unlike during the user
pre-training sessions.
During the real-time experiment, 14 neurons are obtained from 10 electrodes. The
neural states are represented by the firing rates on a 2sec window following the go
signal.
7.3.2 Agent
For the BMI decoder, we use Q-learning implemented with kernel Temporal
Differences (Q-KTD) (λ). The advantage of KTD for online applications is that it does
not depend on the initialization; neither does it require any prior information about input
sates. Also, this algorithm brings the advantages of both TD learning [50] and kernel
methods [44]. Therefore it is expected that the algorithm predicts properly the neural
state to action map, even though the neural states vary in each experiment. Based
on the monkey’s neural state, the BMI decoder produces an output using the Q-KTD
algorithm. The output represents the 2 possible directions (left and right), and the robot
arm moves accordingly.
One big difference between open and closed loop applications is the amount of
accessible data; in the closed loop experiment, we can only get information about the
neural states up to the current state. In the previous offline experiment, normalization
and kernel selection were conducted off line based on the entire data set. However,
it is not possible to apply the same method to the online setting since we only have
information about the input states up to the present time. Normalization is a scaling
factor which impacts the kernel size; proper selection of tge kernel size brings proper
scaling to the data. The dynamic range of states can change from experiment to
110
experiment. Consequently, in an online application, the kernel size needs to be adjusted
at each time. Before getting any neural states, the kernel size cannot be determined.
Thus, in contrast to the previous open loop experiments, normalization of the input
neural states is not applied, and the kernel size is automatically selected from the given
inputs.
For Q-KTD(λ), the Gaussian kernel (4–3) is employed. The kernel size h is
automatically selected based on the history of inputs. Note that in the closed loop
experiments, the dynamic range of states varies from experiment to experiment.
Consequently, the kernel size needs to be readjusted each time a new experiment takes
place and cannot be determined before hand. At each time, the distances between
the states are computed to calculate the output values. Therefore, we use the distance
values to select the kernel size as follows:
htemp(n) =
√√√√ 1
2(n − 1)
n−1∑i=1
∥x(i)− x(n)∥2 (7–3)
h(n) =1
n
[n−1∑i=1
h(i) + htemp(n)
](7–4)
Using the squared distances between pairs of previously seen input states, we can
obtain an estimate of the mean distance, and this value is also averaged along with past
kernel sizes to assign the current kernel size.
The initial error is set to zero, and the first input state vector is assigned as the
first unit’s center. Normalization of the input neural states is not applied, and a stepsize
η = 0.5 is used. Moreover, we consider γ = 1 and λ = 0 since our experiment performs
single step trials in (5–8).
7.3.3 Results
The overall performance is evaluated by checking whether the robotic arm reaches
the assigned target or not. Once the robot arm reaches the target, the decoder gets a
positive reward +1, otherwise, it receives negative reward −1.
111
Figure 7-13 shows the decoder performance for 2 experiments; the first experiment
(left column) has a total of 20 trials (10 A trials and 10 B trials). The overall success rate
was 90%. Only the first trial for each target was mis-assigned. The second experiment
(right column) has a total of 53 trials (27 A trials and 26 B trials), with overall success
rate of 41/53 (around 77%). Although the success rate of the second experiment is
Figure 7-13. Performance of Q-learning via KTD in the closed loop RLBMI controlled bya monkey for experiment 1 (left) and experiment 2 (right); The success (+1)and failure (−1) index of each trial (top), the change of TD error (middle),and the change of Q-values (down).
not as high as the first experiment, both experiments show that the algorithm learns an
appropriate neural state to action map. Even though there is variation among the neural
states within each experiment, the decoder adapts well to minimize the TD error, and
the Q-values converge to the desired values for each action; since this is a single step
task and the reward +1 is assigned for a successful trial, it is desired that the estimated
Q-value ~Q be close to +1.
112
It is observed that the TD error and Q-value are oscillating. The drastic change of
TD error or Q-value corresponds to the missed trial. The overall performance can be
evaluated by checking whether the robot arm reaches the desired target or not (the top
plots in Figure 7-13). However, this assessment does not show what causes the change
in the system values. In addition, it is hard to know how the two separate intelligent
systems interact during learning and how neural states affect the overall performance.
7.3.4 Closed Loop Performance Analysis
Since this RLBMI architecture contains 2 separate intelligent systems that co-adapt,
it is important to have not only a well performing BMI decoder but also a well trained BMI
user. Under the co-adaptation scenario, it is obvious that if one system does not perform
properly, it will cause detrimental effects on the performance of the other system. If the
BMI decoder does not give proper updates to the robotic device, it will confuse the user
conducting the task, and if the user gives improper state information or the translation is
wrong, the resulting update may fail even though the BMI decoder was able to find the
optimal mapping function.
Here, we analyze how each participant (agent and user) influences the overall
performance both in successful and missed trials by visualizing the states, corresponding
action values Q, and resulting policy in a two-dimensional space. This is the first attempt
to evaluate the individual performance of the subject and the computer agent on
a closed loop Reinforcement Learning Brain Machine Interface (RLBMI). With the
proposed methodology, we can observe how the decoder effectively learns a good state
to action mapping, and how neural states affect the prediction performance. A major
assumption in our methodology is to assume that the user always implements the same
strategy to solve the task, otherwise this analysis breaks down. Under these conditions,
when the system encounters a new condition we therefore assume that the user is
distracted or uncooperative. But this may not be the case and we did not have access to
enough extra information to quantify behavior besides visual inspection.
113
In the two-target reaching task, the decoder contains two output units representing
the functions Q(x , a = left) and Q(x , a = right). The policy is determined by
selecting the action associated with one of these units based on their Q-values. The
performance of the decoder is commonly evaluated in terms of success rate by counting
the successful trials that reach the desired targets, along with the changes in the TD
error or the Q-values. However, these criteria are not well suited to understand how
the two intelligent systems interact during learning. For instance, if there is a change in
performance or an error in the decoding process it is hard to tell which one of the two
subsystems is more likely to be responsible for it.
Another added difficulty in evaluating the user’s output is that the neural states are
high dimensional vectors. In this sense, we want to apply a dimensionality reduction
technique to produce a user’s output that can be visualized and easily interpreted,
nevertheless being independent of the class labels (unsupervised). We found that
principal component analysis (PCA) on the set of observed neural states is sufficient
for the goal of this analysis. PCA is a well known method to transform data to a new
coordinate system based on eigenvalue decomposition of a data covariance matrix. Let
X = [x(1), x(2), · · · , x(n)]⊤ be the data matrix containing the set of observed states
during the closed loop experiment until time n. A transformed dataset Y = XW can
be obtained by using the transformation matrix W, which corresponds to the matrix of
eigenvectors of the covariance matrix N−1X
⊤X. Without loss of generality we assume
that the data X has zero mean. The distribution of states up to time n can be visualized
by projecting the high dimensional neural states into two dimensions using the first two
largest principal components.
In this two-dimensional space of projected neural states, we can also show the
relation with the decoder by computing the outputs of the units associated with each
one of the actions and displaying them as contour plots. A set of two-dimensional space
locations Ygrid evenly distributed on the plane can be projected in the high dimensional
114
space of neural states as X̂ = YgridW⊤. Let Q(n)
i be the i unit from the decoder updated
using (5–9) at time n. We can compute the estimated Q-values at a point y on the two
dimensional plane using Q̂(n)(x̂ = Wy , a = i). In this way, we can extrapolate the
possible outputs that the decoder would produce in the vicinity of the already observed
data points. Furthermore, the final estimated policy can be obtained by selecting the unit
that matches the action with the maximum Q-value among all output units (Figure 7-14).
Figure 7-14. Proposed visualization method.
Here, we visualize the neural states and corresponding Q-values and policy π
related to the final performance. Thus, the final learned decoder Q̂(T ) and all the neural
states X are utilized; that is, n = T and X is of size T × d where d is the dimension
of the neural state vectors. Notice that the proposed method can also be applied at
any stage of the learning process; we can observe the behavior of two systems at any
intermediate time by using the subset of neural states that have been observed as well
as the learned decoder up to this time.
Figure 7-15 provides a visualization of the distribution of the 14 dimensional
neural states projected into two dimensions. The corresponding contour levels are the
estimated action values ~Q using the learned decoder from the closed loop experiment.
In addition, we provide the partition for left and right actions in the projected two
dimensional space, which corresponds to the final policy derived from the estimated
Q-values. The projection shows that the neural states from the two classes are
separable. As we expected, the Q-values for each direction have higher values on
regions occupied by the corresponding neural states. For example, the Q-values for the
115
−50 0 50
−120
−100
−80
−60
−40
−20
0
20
First Component
Second C
om
ponent
−50 0 50
−120
−100
−80
−60
−40
−20
0
20
First Component
−50 0 50
−120
−100
−80
−60
−40
−20
0
20
First Component
Second C
om
ponent
Estimated Policy
−50 0 50 100 150 200
−180
−160
−140
−120
−100
−80
−60
−40
−20
0
20
First Component
Q−value
A trial
B trial
−50 0 50 100 150 200
−180
−160
−140
−120
−100
−80
−60
−40
−20
0
20
First Component
Q−value
A trial
B trial
−50 0 50 100 150 200
−180
−160
−140
−120
−100
−80
−60
−40
−20
0
20
First Component
Second C
om
ponent
Estimated Policy
Q−value
A trial
B trial
Q−value
A trial
B trial0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−0.8
−0.6
−0.4
−0.2
0
0.2
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 7-15. The estimated Q-values (top) and resulting policy (bottom) for the projectedneural states using PCA from experiment 1 (left column) and experiment 2(right column). The first and third top plots show the Q-values for “right”direction, and the second and forth top plots show the Q-values for “left”direction.
right direction have larger values for the areas filled by the states corresponding to B
trial. This is confirmed by showing the partitions achieved by the resulting policy.
During the training session, the success rates were highly dependent on the
monkey’s performance. Most of the times when the agent predicted the wrong target,
it was observed that the monkey was distracted, or it was not interacting with the task
properly. We are also able to see this phenomenon from the plots; the failed trials during
the closed loop experiment are marked as red stars (missed A trials) and green dots
(missed B trials). We can see that most of the neural states that were misclassified
appear to be closer to the states corresponding to the opposite target in the projected
state space. This supports the idea that failure during these trials was mainly due to the
monkey’s behavior and not to the decoder.
116
From the bottom plots, it is apparent that the decoder can predict nonlinear policies.
Finally, the estimated policy in experiment 2 (bottom right plot) shows that the system
effectively learns and goes from an initially misclassified A trial (during the closed loop
experiment), which is located near the border and right bottom areas, to a final decoder
where the same state would be assigned to the right direction. It is a remarkable fact
that the system adapts to the environment on-line.
We have applied the Q-KTD(λ) and CKTD algorithms to neural decoding for brain
machine interfaces. In the open loop RLBMI experiment, we confirmed that the system
was able to find a proper neural state to action mapping. In addition, we saw how
by using correntropy as a cost function there could be potential improvements to the
learning speed. Finally, Q-KTD(λ) was successfully applied to closed loop experiments.
The decoder was able to provide the proper robot arm actions. Finally, we explored a
first attempt at analyzing the behavior of the two intelligent systems separately. With
the proposed methodology, we observed how the neural state influences the decoder
performance.
117
CHAPTER 8CONCLUSIONS AND FUTURE WORK
The reinforcement learning brain machine interface (RLBMI) [13] has been shown
to be a promising paradigm for BMI implementations. It allows co-adaptive learning
between two intelligent systems; one is the BMI decoder on the agent side, and the
other is the BMI user as part of the environment. From the agent side, the proper neural
decoding of the motor signals is essential to control the external device that interacts
with the physical environment. However, there are several challenges that must be
addressed in order to turn RLBMI into a practical reality. First, algorithms must be able
to readily handle high dimensional states spaces that correspond to the neural state
representation. The mapping from neural states to actions must be flexible enough to
handle nonlinear mappings yet making little assumptions. Algorithms should require a
reasonable amount of computational resources that allows real time implementations.
The algorithms should handle cases where assumptions may not hold, i.e. the presence
of outliers or perturbations in the environment. In this thesis, we have introduced
algorithms that take into account the above mentioned issues. We have employed
synthetic experiments that illustrate the properties of the proposed methods and
encourage their applicability in practical scenarios. Finally, we applied these algorithms
to RLBMI experiments, showing their potential advantages in a relevant application.
We started by introducing three new temporal difference (TD) algorithms for state
value function estimation. State value function estimation is an intermediate step to
find a proper mapping from state to action, from which all fundamental features of the
algorithms could be observed. This functional approximation is able to handle large
amounts of input data which is often required in practical implementations. We have
seen how the proposed TD algorithms can provide functional approximation of state
value functions given a policy.
118
Kernel temporal difference (KTD)(λ) was proposed by integrating kernel-based
representations to the conventional TD learning. The big advantages of this kernel-based
learning algorithm are the nonlinear functional approximation capabilities along with
the known convergence guarantees of linear TD learning, which results in more
accurate and faster learning. Using the dual representations, it can be shown that
the convergence results for linear TD(λ) extend to the kernel-based algorithm. By using
strictly positive definite kernels, the linear independence condition is automatically
satisfied for input state representations in absorbing Markov processes. Experiments on
simulated data drawn from absorbing Markov chains allowed us to confirm the method’s
nonlinear approximation capabilities.
Moreover, robust variants of TD(λ) and KTD(λ) algorithms were proposed by using
correntropy as a cost function. Namely, correntropy temporal difference (CTD) and
correntropy kernel temporal difference (CKTD) were derived for the case of λ = 0.
TD(λ) and KTD(λ) use mean square error (MSE) as their objective function which has
known limitations for handling non Gaussian noise corrupted environments. Experiments
using a synthetic absorbing Markov chain showed CTD and CKTD are able to provide
better robustness performance than MSE under non-Gaussian noise or perturbed state
transitions.
We have observed that KTD(λ) has better performance with λ = 0 than larger λ due
to the relation between stepsize η and the eligibility trace rate λ. In multistep prediction
problems, when the Gaussian kernel is employed in the system, larger eligibility trace
rates require smaller stepsizes for stable performance, which also depends on the
allowed number of steps per trial. Small stepsizes for large λ make the performance
slower compared to the larger learning rates that small λ values allow. Thus, it is intuitive
that the performance of CTD and CKTD with larger λ may not perform as well as λ = 0
in on-line implementations. However, it is necessary to further explore the behavior
of CTD and CKTD with general λ. The extension of TD(0) and KTD(0) to general λ
119
using the multi-step prediction as a starting point does not seem to be applicable for
correntropy, since there is no obvious way to interchange terms in the cost due to the
nonlinearity of the kernel employed by correntropy; no updates can be made before
a trial is complete. The update rule we derived for general λ requires us to update
the system once a trial is complete. Therefore, further study for the derivation of CTD
and CKTD for general λ is required. In addition, we observed that CTD and CKTD
have stable performance. However, further analysis is still required to determine the
convergence points.
We extended all proposed algorithms to state-action value functions based on
Q-learning. This extension allows us to find a proper state to action mapping which
can be further exploited in practical cases such as the neural decoding problem in
reinforcement learning brain machine interfaces. The introduced TD algorithms were
extended to estimate action-value functions, and based on the estimated values, the
optimal policy can be decided using Q-learning. Three variants of Q-learning were
derived: Q-learning via correntropy temporal difference (Q-CTD), Q-KTD(λ), and
Q-CKTD.
The observation and analysis of CTD, KTD(λ), and CKTD gives us a basic idea
of how the proposed extended algorithms behave. However, in the case of Q-CTD,
Q-KTD(λ), and Q-CKTD, the convergence analysis is still challenging since Q-learning
contains both a learning policy and a greedy policy. In the case of Q-KTD(λ), the
convergence proof for Q-learning using temporal difference (TD)(λ) with linear function
approximation in [32] gives a basic intuition for the role of function approximation on
the convergence of Q-learning. For the kernel-based representation in Q-KTD(λ), the
direct extension of the results from [32] would bring the advantages of nonlinear function
approximation. Nonetheless, to apply these results, it is required an extended version
of the ordinary differential equation (ODE) method for Hilbert space valued differential
equations.
120
The extended algorithms were applied to find an optimal control policy in decision
making problems where the state space is continuous. We observed the behavior of
Q-KTD and Q-CKTD under various parameter sets including kernel size, stepsize, and
eligibility trace rate. From the experiments, we observed that the optimal filter kernel size
depends on the input distribution and affects the learning speed, and proper annealing
of the stepsize is required for convergence. For KTD small eligibility traces tend to work
better. In the case of correntropy, the kernel size presents a trade off between learning
speed and robustness and also depends on the error distribution. Results showed that
Q-KTD(λ) can offer performance advantages over other conventional nonlinear function
approximation methods. Furthermore, it is important to highlight how the robustness
property of the correntropy criterion can be exploited to improve learning under changing
policies. We have empirically observed that Q-CKTD was able to provide a better policy
in the off-policy learning paradigm.
Furthermore, Q-KTD(λ) was applied to estimate an optimal policy in open loop
brain machine interface (BMI) problems, and experimental results show the method can
effectively learn the brain-state action mapping. We also tested Q-CKTD on an open
loop RLBMI application to assess the algorithm’s capability in estimating a proper state
to action map. In off-policy TD learning, Q-CKTD results showed that the optimal policy
could be estimated even without having perfect predictions of the value function in a
problem involving a discrete set of actions.
Finally, we applied Q-KTD to closed loop RLBMI experiments using a monkey.
Results showed that the algorithm succeeds in finding a proper mapping between neural
states and desired actions. Therefore, the kernel filter structure is a suitable approach
to obtain a flexible neural state decoder that can be learned and adapted online. We
also provided a methodology to tease apart the influences of the user and the agent in
the overall performance of the system. This methodology helped us visualize the cases
121
where the errors may have been caused by the user as well as the decision boundaries
that the decoder implements based on the observed neural states.
We saw the successful integration of the proposed TD algorithms in policy search.
This shows that the introduced TD methods have the capability to approximate value
functions properly, which can contribute to finding a proper policy. Actor-Critic is another
well known method to find a policy using an estimated value function. We can also
extend the application of the Q-CTD, Q-KTD, and Q-CKTD algorithms to the Actor-Critic
framework. The Actor-Critic method combines the advantages of policy gradient and
value function approximation with the possibility of better convergence guarantees and
reduced variance on the estimation. The TD algorithms can be applied to the Critic to
estimate the value function, and the policy gradient method can be applied to update the
Actor that chooses the action [23].
122
APPENDIX AMERCER’S THEOREM
Let X be a compact subset of Rn. Suppose κ is a continuous symmetric function
such that the integral operator Tκ : L2(X)→ L2(X)
(Tk f )(·) =∫X
κ(·, x)f (x)dx ,
is positive, that is ∫X×X
κ(x , z)f (x)f (z)dxdz ≥ 0,
for all f ∈ Lx(X). Then we can expand κ(x , z) in a uniformly convergent series (on
X× X) in terms of function ϕj , satisfying ⟨ϕj ,ϕi⟩L2(X) = δij
κ(x , z) =∞∑j=1
λjϕj(x)ϕj(z).
Furthermore, the series∑∞
i=1 |λi | is convergent [33].
123
APPENDIX BQUANTIZATION METHOD
The quantization approach introduced in [9] is a simple yet effective approximation
heuristic that limits the growing structure of the filter by adding units in a selective
fashion. Once a new state input x(i) arrives, its distances to each existing unit C(i − 1)
are calculated
dist(x(i),C(i − 1)) = min1≤j≤size(C(i−1))
∥x(i)− Cj(i − 1)∥. (B–1)
If the minimum distance dist(x(i),C(i − 1)) is smaller than the quantization size ϵU , the
new input state x(i) is absorbed by the closest existing unit to it, and hence no new unit
is added to the structure. In this case, unit centers remain the same C(i) = C(i − 1), but
the connection weights to the closest unit are updated.
124
REFERENCES
[1] Bae, Jihye, Chhatbar, Pratic, Francis, Joseph T., Sanchez, Justin C., and Principe,Jose C. “Reinforcement Learning via Kernel Temporal Difference.” The 33rd AnnualInternational Conference of the IEEE on Engineering in Medicine and BiologySociety. 2011, 5662–5665.
[2] Bae, Jihye, Giraldo, Luis Sanchez, Chhatbar, Pratic, Francis, Joseph T., Sanchez,Justin C., and Principe, Jose C. “Stochastic Kernel Temporal Difference forReinforcement Learning.” IEEE International Workshop on Machine Learning forSignal Processing. 2011, 1–6.
[3] Baird, Leemon. “Residual Algorithms: Reinforcement Learning with FunctionApproximation.” Machine Learning. 1995, 30–37.
[4] Boser, Bernhard E., Guyon, Isabelle M., and Vapnik, Vladimir N. “A TrainingAlgorithm for Optimal Margin Classifiers.” In Proceedings of the 5th AnnualWorkshop on Computational Learning Theory (COLT). 1992, 144–152.
[5] Boyan, Justin A. Learning Evaluation Functions for Global Optimization. Ph.D.thesis, Carnegie Mellon University, 1998.
[6] ———. “Technical Update: Least-Squares Temporal Difference Learning.” MachineLearning 49 (2002): 233–246.
[7] Boyan, Justin A. and Moore, Andrew W. “Generalization in Reinforcement Learning:Safely Approximating the Value Function.” Advances in Neural InformationProcessing Systems. 1995, 369–376.
[8] Bradtke, Steven J. and Barto, Andrew G. “Linear Least-Squares Algorithms forTemporal Difference Learning.” Machine Learning 22 (1996): 33–57.
[9] Chen, Badong, Zhao, Songlin, Zhu, Pingping, and Principe, Jose C. “QuantizedKernel Least Mean Square Algorithm.” IEEE Transactions on Neural Networks andLearning Systems 23 (2012).1: 22–32.
[10] Dayan, Peter and Sejnowski, Terrence J. “TD(λ) Converges with Probability 1.”Machine Learning 14 (1994): 295–301.
[11] Deisenroth, Marc Peter. Efficient Reinforcement Learning using Gaussian Process.Ph.D. thesis, Karlsruhe Institute of Technology, 2010.
[12] Dietterich, Thomas G. and Wang, Xin. “Batch Value Function Approximation viaSupport Vectors.” Advances in Neural Information Processing Systems. MIT Press,2001, 1491–1498.
[13] DiGiovanna, Jack, Mahmoudi, Babak, Fortes, Jose, Principe, Jose C., andSanchez, Justin C. “Coadaptive Brain-Machine Interface via ReinforcementLearning.” IEEE Transactions on Biomedical Engineering 56 (2009).1.
125
[14] Engel, Yaakov, Mannor, Shie, and Meir, Ron. “Reinforcement learning withGaussian processes.” In Proceedings of the 22nd International Conference onMachine Learning. 2005, 201–208.
[15] Geramifard, Alborz, Bowling, Michael, and Sutton, Richard S. “IncrementalLeast-Squares Temporal Difference Learning.” In Proceedings of the 21st NationalConference on Artificial Intelligence. 2006, 356–361.
[16] Geramifard, Alborz, Bowling, Michael, Zinkevich, Martin, and Sutton, Richard S.“iLSTD: Eligibility Traces and Convergence Analysis.” Advances in Neural Informa-tion Processing Systems. 2007, 441–448.
[17] Ghavamzadeh, Mohammad and Engel, Yaakov. “Bayesian Actor-Critic Algorithms.”In Proceedings of the 24th International Conference on Machine Learning. 2007.
[18] Gunduz, Aysegul and Principe, Jose C. “Correntropy as a Novel Measure forNonlinearity Tests.” International Joint Conference on Neural Networks 89 (2009).
[19] Haykin, Simon. Neural Networks: a comprehensive foundation. Maxwell, 1994.
[20] ———. Neural Networks and learning Machines. Prentice Hall, 2009.
[21] Jeong, Kyu-Hwa and Principe, Jose C. “The Correntropy Mace Filter for ImageRecognition.” In Proceedings of the 16th IEEE Signal Processing Society Workshopon Machine Learning for Signal Processing. 2006, 9–14.
[22] Kim, Sung-Phil, Sanchez, Justin C., Rao, Yadunandana N., Erdogmus, Deniz,Carmena, Jose M., Lebedev, Mikhail A., Nicolelis, Miguel. A. L., and Principe,Jose C. “A Comparison of Optimal MIMO Linear and Nonlinear Models forBrain-Machine Interfaces.” Journal of Neural Engineering 3 (2006).145.
[23] Konda, Vijay R. and Tsitsiklis, John N. “On Actor-Critic Algorithms.” Societyfor Industrial and Applied Mathematics Journal on Control and Optimization 42(2003).4: 1143–1166.
[24] Kushner, Harold J. and Clark, Dean S. Stochastic Approximation Methods forConstrained and Unconstrained Systems. Springer-Verlag, 1978.
[25] Liu, Weifeng, Park, Il, and Principe, Jose C. “An Information Theoretic Approach ofDesigning Sparse Kernel Adaptive Filters.” IEEE Transactions on Neural Netwtorks20 (2009).12: 1950–1961.
[26] Liu, Weifeng, Pokharel, Puskal P., and Principe, Jose C. “Correntropy: Propertiesand Applications in Non-Gaussian Signal Processing.” IEEE Transactions on SignalProcessing 55 (2007).11: 5286–5298.
[27] ———. “The Kernel Least Mean Square Algorithm.” IEEE Transactions on SignalProcessing 56 (2008).2: 543–554.
126
[28] Liu, Weifeng, Principe, Jose C., and Haykin, Simon. Kernel Adaptive Filtering: AComprehensive Introduction. Wiley, 2010.
[29] Maei, Hamid Reza, Szepesvari, Csaba, Bhatnagar, Shalabh, and Sutton,Richard S. “Toward Off-Policy Learning Control with Function Approximation.”Proceeding of the 27th International Conference on Machine Learning. 2010.
[30] Mahmoudi, Babak. Integrating Robotic Action with Biologic Perception: A BrainMachine Symbiosis Theory. Ph.D. thesis, University of Florida, 2010.
[31] Mahmoudi, Babak, DiGiovanna, Jack, Principe, Jose C., and Sanchez, Justin C.“Co-Adaptive Learning in Brain-Machine Interfaces.” Brain Inspired CognitiveSystems (BICS). 2008.
[32] Melo, Francisco S., Meyn, Sean P., and Ribeiro, M. Isabel. “An Analysis ofReinforcement Learning with Function Aproximation.” In Proceedings of the25th International Conference on Machine Learning. 2008, 664–671.
[33] Mercer, John. “Functions of Positive and Negative Type, and Their Connection withthe Theory of Integral Equations.” Philosophical Transactions of the Royal Societyof London 209 (1909): 415–446.
[34] Moore, Andrew W. “Variable Resolution Dynamic Programming: Efficiently LearningAction Maps in Multivariate Real-valued State-spaces.” In Proceedings of the 8thInternational Conference on Machine Learning. 1991.
[35] Mulliken, Grant H., Musallam, Sam, and Andersen, Richard A. “DecodingTrajectories from Posterior Parietal Cortex Ensembles.” The Journal of Neuro-science 28 (2008).48: 12913–12926.
[36] Park, Il and Principe, Jose C. “Correntropy Based Granger Causality.” IEEEInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP).2008, 3605–3608.
[37] Pohlmeyer, Eric A., Mahmoudi, Babak, Geng, Shijia, Prins, Noe, and Sanchez,Justin C. “Brain-machine interface control of a robot arm using actor-criticrainforcement learning.” Annual International Conference of the IEEE on Engi-neering in Medicine and Biology Society (EMBC). 2012, 4108–4111.
[38] Principe, Jose C. Information Theoretic Learning. Springer, 2010.
[39] Rasmussen, Carl Edward and Kuss, Malte. “Gaussian Processes in ReinforcementLearning.” Advances in Neural Information Processing Systems. MIT Press, 2004,751–759.
[40] Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes forMachine Learning. MIT Press, 2006.
127
[41] Sanchez, Justin C., Tarigoppula, Aditya, Choi, John S., Marsh, Brandi T., Chhatbar,Pratik Y., Mahmoudi, Babak, and Francis, Joseph T. “Control of a center-outreaching task using a reinforcement learning Brain-Machine Interface.” The5th International IEEE/EMBS Conference on Neural Engineering(NER). 2011,525–528.
[42] Santamaria, Ignacio, Pokharel, Puskal P., and Principe, Jose C. “GeneralizedCorrelation Function: Definition, Properties, and Application to Blind Equalization.”IEEE Transactions on Signal Processing 54 (2006).6.
[43] Saunders, Craig, Gammerman, Alexander, and Vovk, Volodya. “Ridge RegressionLearning Algorithm in Dual Variables.” In Proceedings of the 15th InternationalConference on Machine Learning. 1998, 515–521.
[44] Scholkopf, Bernhard and Smola, Alexander J. Learning with Kernels. MIT Press,2002.
[45] Singh, Abhishek and Principe, Jose C. “Using Correntropy as a cost function inlinear adaptive filters.” The 2009 International Joint Conference on Neural Networks(IJCNN). 2009, 2950–2955.
[46] ———. “A Closed Form Recursive Solution for Maximum Correntropy Training.”2010 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP). 2010, 2070 – 2073.
[47] ———. “A loss function for classification based on a robust similarity metric.” The2010 International Joint Conference on Neural Networks (IJCNN). 2010, 1–6.
[48] Singh, Satinder P. and Sutton, Richard S. “Reinforcement Learning with ReplacingEligibility Traces.” Machine Learning 22 (1996): 123–158.
[49] Sussillo, David, Nuyujukian, Paul, Fan, Joline M., Kao, Jonathan C., Stavisky,Sergey D., Ryu, Stephen, and Shenoy, Krishna. “A recurrent neural network forclosed-loop intracortical brain-machine interface decoders.” Journal of NeuralEngineering 9 (2012).2.
[50] Sutton, Richard S. “Learning to Predict by the Methods of Temporal Differences.”Machine Learning 3 (1988): 9–44.
[51] ———. “Open Theoretical Questions in Reinforcement Learning.” Tech. rep., AT& TLabs, 1999.
[52] Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction.MIT Press, 1998.
[53] Szepesvari, Csaba. Algorithms for Reinforcement Learning. Morgan & Slaypool,2010.
128
[54] Tsitsiklis, John N. and Roy, Benjamin Van. “An Analysis of Temporal-DifferenceLearning with Function Approximation.” Tech. Rep. 5, IEEE Transactions onAutomatic Control, 1997.
[55] Watkins, Christopher J. C. H. Learning from Delayed Rewards. Ph.D. thesis, King’sCollege, 1989.
[56] Watkins, Christopher J. C. H. and Dayan, Peter. “Technical Note: Q-Learning.”Machine Learning 8 (1992).3-4: 279–292.
[57] Xu, Xin, Hu, Dewen, and Lu, Xicheng. “Kernel-Based Least Squares Policy Iterationfor Reinforcement Learning.” IEEE Transactions on Neural Networks 18 (2007).4.
[58] Xu, Xin, Xie, Tao, Hu, Dewen, and Lu, Xicheng. “Kernel Least-Squares TemporalDifference Learning.” International Journal of Information Technology. vol. 11. 2005,54–63.
[59] Zhao, Songlin, Chen, Badong, and Principe, Jose C. “Kernel Adaptive Filteringwith Maximum Correntropy Criterion.” The 2011 International Joint Conference onNeural Networks (IJCNN). 2011, 2012–2017.
129
BIOGRAPHICAL SKETCH
Jihye Bae received a Bachelor of Engineering in the School of Electrical Engineering
and Computer Science at Kyungpook National University, Daegu, South Korea in 2007,
and the Master of Science and Doctor of Philosophy (Ph.D.) in the Department of
Electrical and Computer Engineering at University of Florida, Gainesville, Florida, the
United States of America in 2009 and 2013, respectively. She joined the Computational
Neuro-Engineering Laboratory (CNEL) at University of Florida in 2010 during her
Ph.D. studies and worked as a research assistant under the supervision of Prof. Jose
C. Principe at CNEL. Her research interests encompass adaptive signal processing,
machine learning, and their applications in brain machine interfaces including neural
decoding and control problems. Her current research mainly focuses on kernel methods
and information theoretic learning, and how both areas can be applied in reinforcement
learning.
130