kernel temporal differences for reinforcement...

KERNEL TEMPORAL DIFFERENCES FOR REINFORCEMENT LEARNING WITHAPPLICATIONS TO BRAIN MACHINE INTERFACES

By

JIHYE BAE

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2013

c⃝ 2013 Jihye Bae

2

I dedicate this to my family for their endless support.

3

ACKNOWLEDGMENTS

I would like to sincerely thank my Ph.D. advisor Prof. Jose C. Principe for his

invaluable guidance, understanding, and patience. It is hard to imagine that I could

complete this program without his continual support. It was a fortune for me to meet

Prof. Principe. Thanks to him, I was able to obtain in-depth knowledge in adaptive signal

processing and information theoretic learning and to have an unforgettable life time

opportunity to enhance my view about research.

I would like to thank Prof. Justin C. Sanchez for enriching my knowledge in

neuroscience and supporting my research. His willingness and openness to collaborate

gave me chances to learn and to conduct practical experiments that have become

an important part of this dissertation. In addition, I want to thank Dr. Sanchez’s lab

members, especially Dr. Eric Pohlmeyer and Dr. Babak Mahmoudi for their help, advise,

and fruitful discussions. I would also like to thank my Ph.D. committee members, Prof.

John G. Harris, Prof. Paul D. Gader, and Prof. Arunava Banerjee for their valuable

comments and the critical feedback about my research.

I was very fortunate for having the opportunity to be part of the Computational

Neuro-Engineering Laboratory (CNEL) at University of Florida. Thanks to CNEL

members, I did not only gain knowledge but also had memories that will remain for life.

I specially thank to my lovely girls, Dr. Lin Li and Dr. Songlin Zhao for their constant

support and wonderful friendship; I will never forget the first day at the University

of Florida and CNEL. You will always be with me in my heart. I also thank Austin

Brockmeier, Evan Kriminger, and Matthew Emigh for the valuable discussions, help,

and for introducing me to the diverse culture of the US. I thank my good old CNEL

friend, Stefan Craciun; I will always miss the energetic and exciting corner. I thank CNEL

alumni, Dr. Erion Hasanbelliu, Dr. Sohan Seth, and Dr. Alexander Singh Alvarado,

for good memories at Greenwich Green and CNEL. I also thank Dr. Divya Agrawal,

Veronica Bolon Canedo, Rakesh Chalasani, Goktug T. Cinar, Rosha Pokharel, Kwansun

4

Cho, Jongmin Lee, Pingping Zhu, Kan Li, In Jun Park, Miguel D. Teixeira, Bilal Fadlallah,

Gavin Philips, Gabriel Nallathambi for their support and good friendship.

Last but not least, I want to thank my family for their encouragement and endless

support including my new family, Dr. Luis Gonzalo Sanchez Giraldo. You are my best

friend, classmate, lab mate, and life time partner. You brought me happiness and faith

and enriched my life both as a researcher and as a person.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 REINFORCEMENT LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 STATE VALUE FUNCTION ESTIMATION/ POLICY EVALUATION . . . . . . . . 32

3.1 Temporal Difference(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.1 Temporal Difference(λ) in Reinforcement Learning . . . . . . . . . 363.1.2 Convergence of Temporal Difference(λ) . . . . . . . . . . . . . . . 37

3.2 Kernel Temporal Difference(λ) . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.1 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.2 Kernel Temporal Difference(λ) . . . . . . . . . . . . . . . . . . . . . 423.2.3 Convergence of Kernel Temporal Difference(λ) . . . . . . . . . . . 44

3.3 Correntropy Temporal Differences . . . . . . . . . . . . . . . . . . . . . . 463.3.1 Correntropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.2 Maximum Correntropy Criterion . . . . . . . . . . . . . . . . . . . . 483.3.3 Correntropy Temporal Difference . . . . . . . . . . . . . . . . . . . 493.3.4 Correntropy Kernel Temporal Difference . . . . . . . . . . . . . . . 50

4 SIMULATIONS - POLICY EVALUATION . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Linear Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Linear Case - Robustness Assessment . . . . . . . . . . . . . . . . . . . 574.3 Nonlinear Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Nonlinear Case - Robustness Assessment . . . . . . . . . . . . . . . . . 69

5 POLICY IMPROVEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1 State-Action-Reward-State-Action . . . . . . . . . . . . . . . . . . . . . . 765.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Q-learning via Kernel Temporal Differences and Correntropy Variants . . 775.4 Reinforcement Learning Brain Machine Interface Based on Q-learning

with Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 80

6

6 SIMULATIONS - POLICY IMPROVEMENT . . . . . . . . . . . . . . . . . . . . 82

6.1 Mountain Car Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2 Two Dimensional Spatial Navigation Task . . . . . . . . . . . . . . . . . . 88

7 PRACTICAL IMPLEMENTATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.1 Open Loop Reinforcement Learning Brain Machine Interface: Q-KTD(λ) . 957.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.1.2 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.1.3 Center-out Reaching Task - Single Step . . . . . . . . . . . . . . . 977.1.4 Center-out Reaching Task - Multi-Step . . . . . . . . . . . . . . . . 102

7.2 Open Loop Reinforcement Learning Brain Machine Interface: Q-CKTD . . 1047.3 Closed Loop Brain Machine Interface Reinforcement Learning . . . . . . 107

7.3.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.3.2 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.3.4 Closed Loop Performance Analysis . . . . . . . . . . . . . . . . . . 113

8 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . 118

APPENDIX

A MERCER’S THEOREM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

B QUANTIZATION METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7

LIST OF TABLES

Table page

6-1 The average success rate of Q-KTD and Q-CKTD. . . . . . . . . . . . . . . . . 93

8

LIST OF FIGURES

Figure page

1-1 The decoding structure of reinforcement learning model in a brain machineinterface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2-1 The agent and environment interaction in reinforcement learning. . . . . . . . . 28

3-1 Diagram of adaptive value function estimation in reinforcement learning. . . . . 32

3-2 Contours of CIM(X , 0) in 2 dimensional sample space. . . . . . . . . . . . . . . 48

4-1 A 13 state Markov chain [6] for the linear case. . . . . . . . . . . . . . . . . . . 54

4-2 Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in TD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . 55

4-3 Performance over different kernel sizes in KTD(λ). . . . . . . . . . . . . . . . . 56

4-4 Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in KTD(λ) with h = 0.2. . . . . . . . . . . . . . . . . . 56

4-5 Learning curve of TD(λ) and KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . . 58

4-6 The comparison of state value V (x) in x ∈ X convergence between TD(λ)and KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4-7 The performance of TD for different levels (variances σ2) of additive Gaussiannoise on the rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4-8 The performance change of CTD over different correntropy kernel sizes, hc . . . 61

4-9 Learning curve of TD and CTD when the Gaussian noise with variance σ2 =10 is added to the reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4-10 Performance of CTD corresponding to different correntropy kernel sizes hc ,with mixture of Gaussian noise distribution. . . . . . . . . . . . . . . . . . . . . 62

4-11 Learning curves of TD and CTD when the noise added to the rewards correspondsto a mixture of Gaussians. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-12 Performance changes of TD with respect to different Laplacian noise variancesb2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4-13 Performance of CTD depending on different correntropy kernel sizes hc withvarious Laplacian noise variances. . . . . . . . . . . . . . . . . . . . . . . . . . 64

4-14 Learning curve of TD and CTD when the Laplacian noise with variance b2 =25 is added to the reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4-15 A 13 state Markov chain for the nonlinear case. . . . . . . . . . . . . . . . . . . 66

9

4-16 The effect of λ and the initial step size η0 in TD(λ). . . . . . . . . . . . . . . . . 66

4-17 The performance of KTD with different kernel sizes. . . . . . . . . . . . . . . . 67

4-18 Performance comparison over different combinations of λ and the initial stepsizeη in KTD(λ) with h = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4-19 Learning curves of TD(λ) and KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . 68

4-20 The comparison of state value convergence between TD(λ) and KTD(λ). . . . 69

4-21 Performances of CKTD depending on the different correntropy kernel sizes. . . 70

4-22 Learning curve of KTD and CKTD. . . . . . . . . . . . . . . . . . . . . . . . . . 71

4-23 The comparison of state value function ~V estimated by KTD and CorrentropyKTD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4-24 Mean and standard deviation of RMS error over 100 runs at the 2000th trial. . . 73

4-25 Mean RMS error over 100 runs. Notice this is a log plot in the horizontal axis . 73

5-1 The structure of Q-learning via kernel temporal difference (λ) . . . . . . . . . . 79

5-2 The decoding structure of reinforcment learning model in a brain machine interfaceusing a Q-learning based function approximation algorithm. . . . . . . . . . . . 80

6-1 The Mountain-car task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6-2 Performance of Q-TD(λ) with various combination of λ and η. . . . . . . . . . . 84

6-3 The performance of Q-KTD(λ) with respect to different kernel sizes. . . . . . . 85

6-4 Performance of Q-KTD(λ) with various combination of λ and η. . . . . . . . . . 85

6-5 Relative frequency with respect to average number of iterations per trial ofQ-TD(λ) and Q-KTD(λ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6-6 Average number of iterations per trial of Q-TD(λ) and Q-KTD(λ). . . . . . . . . 86

6-7 The performance of Q-CKTD with different correntropy kernel sizes. . . . . . . 88

6-8 Average number of steps per trial of Q-KTD and Q-CKTD. . . . . . . . . . . . . 88

6-9 The average success rates over 125 trials and 50 implementations. . . . . . . . 90

6-10 The average final filter sizes over 125 trials and 50 implementations. . . . . . . 91

6-11 Two dimensional state transitions of the first, third, and fifth sets with η = 0.9and λ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10

6-12 The average success rates over 125 trials and 50 implementations with respectto different filter sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6-13 The change of success rates (top) and final filter size (bottom) with ϵU = 5. . . 93

6-14 The change of average success rates by Q-KTD and Q-CKTD. . . . . . . . . . 94

7-1 The center-out reaching task for 8 targets. . . . . . . . . . . . . . . . . . . . . . 96

7-2 The comparison of average learning curves from 50 Monte Carlo runs betweenQ-KTD(0) and MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7-3 The average success rates over 20 epochs and 50 Monte Carlo runs with respectto different filter sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7-4 The comparison of KTD(0) with different final filter sizes and TDNN with 10hidden units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7-5 The effect of filter size control on 8-target single-step center-out reaching task. 101

7-6 The average success rates for various filter sizes. . . . . . . . . . . . . . . . . . 102

7-7 Reward distribution for right target. . . . . . . . . . . . . . . . . . . . . . . . . . 103

7-8 The learning curves for multi step multi target tasks. . . . . . . . . . . . . . . . 104

7-9 Average success rates over 50 runs. . . . . . . . . . . . . . . . . . . . . . . . . 106

7-10 Q-value changes per tiral during 10 epochs. . . . . . . . . . . . . . . . . . . . . 107

7-11 Target index and matching Q-values. . . . . . . . . . . . . . . . . . . . . . . . . 108

7-12 The success rates of each target over 1 through 5 epochs. . . . . . . . . . . . 109

7-13 Performance of Q-learning via KTD in the closed loop RLBMI . . . . . . . . . . 112

7-14 Proposed visualization method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7-15 The estimated Q-values and resulting policy for the projected neural states. . . 116

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

KERNEL TEMPORAL DIFFERENCES FOR REINFORCEMENT LEARNING WITHAPPLICATIONS TO BRAIN MACHINE INTERFACES

By

Jihye Bae

August 2013

Chair: Jose C. PrincipeMajor: Electrical and Computer Engineering

Reinforcement learning brain machine interfaces (RLBMI) have been shown to be

a promising avenue for practical implementations of BMIs. In the RLBMI, a computer

agent and a user in the environment cooperate and learn co-adaptively. An essential

component in the agent is the neural decoder which translates the neural states of

the user into control actions for the external device in the environment. However, to

realize the advantages of the RLBMI in practice, there are several challenges that need

to be addressed. First, the neural decoder must be able to handle high dimensional

neural states containing spatial-temporal information. Second, the mapping from neural

states to actions must be flexible enough without making strong assumptions. Third,

the computational complexity of the decoder should be reasonable such that real time

implementations are feasible. Fourth, it should be robust in the presence of outliers or

perturbations in the environment. We introduce algorithms that take into account these

four issues.

To efficiently handle the high dimensional state spaces, we adopt the temporal

difference (TD) learning which allows the learning of the state value function using

function approximation. For a flexible decoder, we propose the use of kernel base

representations which provides nonlinear extensions of TD(λ) which we call kernel

temporal difference (KTD)(λ). Two key advantages of KTD(λ) are its nonlinear functional

approximation capabilities and convergence guarantees that gracefully emerge as

12

an extension of the convergence results known for linear TD learning algorithms. To

address the robustness issue, we introduce correntropy temporal difference (CTD) and

correntropy kernel temporal difference (CKTD), which is a robust alternative to the mean

square error (MSE) employed by conventional TD learning.

From state value function estimation, all fundamental features of the proposed

algorithms can be observed. However, this is only an intermediate step in finding a

proper policy. Therefore, we extend all proposed TD algorithms to state-action value

function estimation based on Q-learning: Q-learning via correntropy temporal difference

(Q-CTD), Q-KTD(λ), and Q-CKTD. To illustrate the behavior of the proposed algorithms,

we apply them to the problem of finding an optimal policy on simulated sequential

decision making with continuous state spaces. The results show that Q-KTD and

Q-CKTD are able to find a proper control policy and give stable performance with the

appropriate parameters, and that Q-CKTD improves performance in off-policy learning.

Finally, the Q-KTD(λ) and Q-CKTD algorithms are applied to neural decoding in

RLBMIs. First, they are applied in open-loop experiments to find a proper mapping

between a monkey’s neural states and desired positions of a computer cursor or a

robotic arm. The experimental results show that the algorithms can effectively learn

the neural state-action mapping. Moreover, Q-CKTD shows that the optimal policy

can be estimated even without having perfect predictions of the value function with

a discrete set of actions. Q-KTD is also applied to closed-loop RLBMI experiments.

The co-adaptation of the decoder and the subject are observed. Results show that the

algorithm succeeds in finding a proper mapping between neural states and desired

actions. The kernel based representation combined with temporal differences is a

suitable approach to obtain a flexible neural state decoder that can be learned and

adapted online. These observations show the algorithms’ potential advantages in

relevant practical applications of RL.

13

CHAPTER 1INTRODUCTION

Research in Brain machine interfaces (BMIs) is a multidisciplinary effort involving

fields such as neurophysiology and engineering. Developments in this area have a wide

range of applications, especially for subjects with neuromuscular disabilities, for whom

BMIs may become a significant aid. Neural decoding of motor signals is one of the main

tasks that needs to be executed by the BMI.

Neural decoding is a process of extracting information from brain signals. For

example, we can reconstruct a stimulus based on the spike trains produced by certain

neurons in brain. The main goal of neural decoding is to characterize the electrical

activity of groups of neurons, that is, identifying patterns of behavior that correlate with a

given task. This process is a fundamental step towards the design of prosthetic devices

that communicate directly with brain.

Ideas from system theory can be used to frame the decoding problem. Bypassing

the body can be achieved by modelling the transfer function from brain activity to limb

movement and utilizing the output of the properly trained model to control a robotic

device to implement the intention of movement.

Some approaches to the design of neural decoding systems involve machine

learning methods. In order to choose the appropriate learning method, factors such a

learning speed and stability help in determining the usefulness of a particular method.

Supervised learning is commonly applied in BMI [19] because of the tremendous

body of work in system identification. Given a training set of neural signals and

synchronized movements, the problem is to find a mapping between the two which

can be solved by applying supervised learning techniques; the kinematic variables

of an external device is set as desired signals, and the system can be trained to

obtain the regression model. [22] showed that the well known supervised learning

algorithms such as Wiener filter, least mean square adaptive filter, and time delay neural

14

network are able to estimate the mapping from spike trains from the motor cortex to

the kinematic variables of a monkey’s hand movements. [35] applied linear estimation

algorithms including ridge regression and a modified Kalman filter to estimate the cursor

position on a computer screen based on a monkey’s neural activity; the system was

also implemented for closed-loop brain control experiments. In addition, [49] used an

echostate network, which is one type of recurrent neural network, to decode a monkey’s

neural activity in a center-out reach task in closed loop BMIs. Note that when closed

loop BMI experiments are conducted using supervised learning, a pre-trained functional

regression model is applied to estimate the desired kinematic values; after pre-training,

fixed model parameters are applied, and the system does not adapt simultaneously

during the experiments.

Even though the supervised learning approach has been applied to neural decoding

in real time control of BMIs, it is probably not the most appropriate methodology for the

problem because of the absence of ground truth in a paraplegic user who can not move.

In addition, even if the desired signal was available, there are other factors such as brain

plasticity that still limit the functionality of supervised learning since frequent calibration

(retraining) becomes necessary. In BMIs, it is necessary to have direct communication

between the central nervous system and the computer that controls an external device

such as a prosthetic arm for disabled individuals. Thus, methods that can adapt and

adjust to subtle neural variations are preferred.

When we frame neural decoding as a sequential decision making problem, dynamic

programming (DP) is a classical approach to solve such problems. In sequential

decision making problems, there is a dynamic system whose evolution is affected by

the decisions being made. The goal is to find a decision making rule (feedback policy)

that optimizes a given performance criterion. However, DP has the following drawbacks:

It assumes all model components including the dynamics and the environment are

known, and all states are fully observable. However, in many practical applications, the

15

above conditions are often unsatisfied. In addition, to find an optimal decision maker, DP

requires the evaluation of all the states and controls. This results in high computational

demands when the problem dimension scales up (Bellman’s curse of dimensionality).

Furthermore, direct modelling is rather difficult since there are many factors that need

to be accounted for even within the same task and subject. Although the theoretical

foundation of reinforcement learning (RL) is drawn from dynamic programming (DP),

RL addresses the drawbacks of dynamic programming because it allows us to achieve

an approximation of the optimal value functions of DP without explicitly knowing the

environment.

On the other hand, RL is one of the representative learning schemes (supervised

and unsupervised learning) which provides a general framework for adapting a system

to a novel environment. RL differs from the other learning schemes in the sense that

RL not only observes but also interacts with the environment to collect the information.

Also, RL receives reward information from the environment which is frequently delayed

by unspecified time amounts. Thus, RL is considered the most realistic class of learning

and is rich with many algorithms for on-line learning with low computational complexity.

Reinforcement learning (RL) algorithms are a general framework for system adaptation

to a novel environments; this characteristic is similar to the way biological organisms

interact with environment and learn from experience. In RL, it is possible to learn

only with information from the environment, and thus the need for a desired signal

is suppressed. Therefore, RL is well suited for the neural decoding stage of a BMI

application.

A BMI architecture based on reinforcement learning (RLBMI) is introduced in [13],

and successful applications of this approach can be found in [1, 30, 37]. In the RLBMI

architecture, there are two intelligent systems: the BMI decoder in the agent, and the

user in the environment. The two intelligent systems learn co-adaptively based on

closed loop feedback (Figure 1-1). The agent updates the state of the environment,

16

namely, the position of a cursor on a screen or a robot’s arm position, based on the

user’s neural activity and the received rewards. At the same time, the subject produces

the corresponding brain activity. Through iterations, both systems learn how to earn

rewards based on their joint behavior. The BMI decoder learns a control strategy based

on the user’s neural state and performs actions in goal directed tests that update the

state of the external device in the environment. In addition, the user learns the task

based on the state of the external device. Notice that both systems act symbiotically

by sharing the external device to complete their tasks, and this co-adaptation allows

for continuous synergistic adaptation between the BMI decoder and the user even in

changing environments.

Figure 1-1. The decoding structure of reinforcement learning model in a brain machineinterface.

Note that in the agent, the proper neural decoding of the motor signals is essential

to control the external device that interacts with the physical environment. However,

there are several challenges that must be addressed in practical implementations of

RLBMI:

1. High dimensional input state spacesAlgorithms must be able to readily handle high dimensional state spaces thatcorrespond to the neural state representations.

2. Nonlinear mappingsThe mapping from neural states to actions must be flexible enough to handlenonlinear mappings yet making little assumptions.

3. Computational complexity

17

Algorithms should execute with a reasonable amount of time and resources thatallow them to perform control actions in real time.

4. RobustnessThe algorithms should handle cases where assumptions may not hold, e.g. thepresence of outliers or perturbations in the environment.

In this dissertation, we introduce algorithms that take into account the aforementioned

issues.

RL learns optimal control policies (a map from states to actions) by observing

the interaction of a learning agent with the environment. At each step, the decision

maker decides an action given a state from a system (environment) to generate

desirable states. Over time, the controller (agent) learns by interacting with the system

(environment) while maximizing a quantity known as total reward. The aim of learning is

to derive the optimal control policies to bring the desired behavior into the system, and

the optimality is assessed in terms of the expected total reward known as value function.

Therefore, estimating the value function is a fundamental and crucial algorithmic

component in reinforcement learning problems.

Temporal difference (TD) learning is a method that can be applied to approximate

value functions through incremental computation directly from new experience without

having an associated model of environment. This allows us to efficiently handle high

dimensional states and actions by using adaptive functional approximators, which can

be trained directly from the data. TD algorithms approximate the value function based

on the difference between two estimations corresponding to subsequent inputs in time

(temporal difference error).

The introduction of the TD(λ) algorithm in [50] revived the interest of TD learning

in the RL community. Here, λ represent an eligibility trace rate which is added to the

averaging process over temporal differences to put emphasis on the most recent

observed states and to efficiently deal with the delayed reward. TD(λ) [50] is the

fundamental algorithm used to estimate state value functions, which can be utilized to

18

compute an approximate solution to Bellman’s equation using parametrized functions.

Because TD learning allows system updates directly from the sequence of states, online

learning becomes possible without having a desired signal at all times. For a majority

of real world prediction problems, TD learning has lower memory and computational

demands than supervised learning [50].

Since TD(λ) updates the value function whenever any state transitions are

observed, this may cause inefficient use of data. In addition, the manual selection

of optimal parameters (stepsize and the eligibility trace rate) is still required. A poor

choice of the stepsize and the eligibility trace parameters can cause a dramatically slow

convergence rate or an unstable system. TD(λ) is also sensitive to the distance between

optimal and initial parameters. However, it is popularly applied because of its simplicity

and ability to be used for online learning in multistep prediction problems.

To avoid the possibility of poor performance due to improper choice of the stepsize

and the initialization of parameters in TD(λ), the least squares TD (LSTD) and recursive

least squares TD (RLSTD) were introduced in [8]. Subsequently, an extension to

arbitrary values of λ, LSTD(λ), was proposed in [5]. However, in comparison to TD

(O(d)), LSTD and RLSTD have increased computational complexity per update: O(d3)

and O(d2) respectively, where d is the dimensionality of the state representation space.

The necessity of addressing computational efficiency has stimulated further interest

in online learning. Incremental least squares TD learning called iLSTD, which achieves

per-time-step complexities of O(d), was introduced in [15], and its theoretical analysis

extended to iLSTD(λ) can be found in [16]. This iLSTD uses a similar approach to

RLSTD, but to update the system and keep a low computational load, it only modifies

a single dimension of weight that corresponds to the largest TD update. However,

theoretical analysis shows that convergence cannot be guaranteed under this greedy

approach, and modifications that guarantee convergence increase the computational

cost dramatically. This makes the above algorithm unattractive for online learning.

19

Even though all the above methods provide their own advantages such as

convergence, stability, or learning rate, they are limited to parametrized linear function

approximation which may not be as flexible especially in practical applications where

little prior knowledge can be incorporated. The importance of finding a proper functional

space turns our interest towards nonlinear models which are generally more flexible.

Nonlinear variants of TD algorithms have also been proposed. However, they are mostly

based on time delay neural networks, sigmoidal multilayer perceptrons, or radial basis

function networks. Despite their good approximation capabilities, these algorithms are

usually prone to fall into local minima [3, 7, 20, 54] turning training into an art.

There has been a growing interest in a class of learning algorithms that have

nonlinear approximation capabilities and yet allow cost functions that are convex. They

are known as kernel based learning algorithms [44]. One of the major appeals of kernel

methods is the ability to handle nonlinear operations on the data by indirectly using an

underlying nonlinear mapping to a so called feature space (Reproducing Kernel Hilbert

Space (RKHS)) which is endowed with an inner product. A linear operation in the RKHS

corresponds to a nonlinear operation in the input space; for some kernel functions these

properties can lead to universal approximation of functions on the input space. Many

of the related optimization problems can be posed as convex (no local minima) with

algorithms that are still reasonably easy to compute (using the kernel trick [44]).

Recent work in adaptive filtering has shown the usefulness of kernel methods in

solving nonlinear adaptive filtering problems [20, 28]. Successful applications of the

kernel-based approach in supervised learning are well known through support vector

machines (SVM) [4], kernel least squares (KLS) [43], and Gaussian processes (GP)

[40]. Kernel-based learning has also been successfully integrated into reinforcement

learning [11, 12, 14, 17, 39, 57] demonstrating their potential advantages in this context.

Furthermore, kernel methods have been integrated with temporal difference

algorithms showing superior performance in nonlinear approximation problems. The

20

close relation between Gaussian processes and kernel recursive least squares was

exploited in [14] to bring the Bayesian framework into TD learning. Gaussian process

temporal difference uses kernels in probabilistic discriminative models based on

Gaussian processes, incorporating parameters such as variance of the observation

noise and providing predictive distributions (posterior variance) to evaluate predictions.

Similar work using kernel-based least squares temporal difference learning with

eligibilities called KLSTD(λ) was introduced in [58]. Unlike GPTD, KLSTD(λ) does

not use a probabilistic approach. The idea in KLSTD is to extend LSTD(λ) [5] using

the concept of duality. However, KLSTD(λ) uses a batch update, so its computational

complexity per time update is O(n3) which is not practical for online learning.

Here, we will investigate TD learning integrated with kernel methods, which we call

kernel temporal difference (KTD) (λ) [1, 2]. We adopt a learning approach based on

stochastic gradient methods, which is very popular in adaptive filtering. When combined

with kernel methods, the stochastic gradient can reduce the computational complexity

to O(n). Namely, we show how KTD(λ) can be derived from the kernel least mean

square (KLMS) [27] algorithm. Although the standard setting in supervised learning

differs from RL, since RL does not use explicit information from a desired signal at every

sample, elements such as the adaptive gain and the approximation error terms can be

well exploited in solving RL problems [50]. KTD shares many features with the KLMS

algorithm [27] except that the error is now obtained using the temporal differences,

i.e. the difference of consecutive outputs is used as the error guiding the adaptation

process.

Online KTD(λ) is well suited for nonlinear function approximation. It avoids some

of the main issues such as local minima or proper initialization that are common

in other nonlinear function approximation methods. In addition, based on the dual

representation, we can show other implicit advantages of using kernels. For instance,

universal kernels automatically satisfy one of the conditions for convergence of TD(λ).

21

Namely, linearly independent representations of states are obtained through the implicit

mapping associated with the kernel.

Even though this non-parametric technique requires a high computational cost that

comes with the inherently growing structure, when the problem is highly complicated

and requires a large amount of data, these techniques produce better solutions than

any other simple linear function approximation methods. In addition, as we will see in

this work, there are methods that we can employ to overcome scalability issues such as

growing filter sizes [9, 25].

In practice, it is common to face the situation where assumptions about noise or the

model deviate from standard considerations. For example, outliers, which result from

unexpected perturbations such as noisy state representations, transitions, or rewards,

can be difficult to be accounted for. In such cases, the controller may fail to obtain the

desired behavior. To the best of our knowledge, no study has addressed the issue of

how noise or small perturbations to the model affect performance in TD learning. Most

studies on TD algorithms focus on synthetic experiments such as simulated Markov

chains or random walk problems.

In our work, we investigate the maximum correntropy criterion (MCC) as an

objective function [38] that aims at coping with the above mention difficulty. Correntropy

is a generalized correlation measure between two random variables first introduced in

[42]. It has been shown that correntropy is useful in non-Gaussian signal processing

[26] and effective for many applications under noisy environments [18, 21, 36, 45, 47].

Correntropy can be applied as a cost function, resulting in the maximum correntropy

criterion (MCC). A system can be adapted in such a way that the similarity between

desired and predicted signals is maximized. MCC serves as an alternative to MSE that

uses higher order information, which makes it applicable to cases where Gaussianity

and linearity assumptions do not necessarily hold.

22

MCC has been applied to obtain robust methods for adaptive systems in supervised

learning [45, 46, 59]. In particular, an interesting blend between KLMS and maximum

correntropy criterion (MCC) was proposed in [59]. The basic idea of kernel maximum

correntropy (KMC) is that input data is transferred to an RKHS using a nonlinear

mapping function, and the maximum correntropy criterion (MCC) is applied as a cost

function to minimize the error. It was shown that the KMC accurately approximates

nonlinear systems, and it was able to reduce the detrimental effects of various types of

noise in comparison to the conventional MSE criterion.

We will show how KMC can be incorporated into TD learning. Correntropy kernel

temporal difference (CKTD) can be derived in a similar way to TD learning when posed

as a supervised learning problem. As a result, we obtain a correntropy temporal

difference (CTD) algorithm, which extends TD and KTD algorithms to the robust

maximum correntropy criterion.

Note that the TD algorithms we have studied are introduced for state value

function estimation given a fixed policy. To solve complete RL problems, the algorithms

should allow the construction of near optimal policies. We want to find the optimal

state-action mapping (policy) by maximizing a cumulative reward, and this mapping

can be exclusively determined by the estimated state-action value function because it

quantifies the relative desirability of different state and action pairs.

Actor-Critic is one way to find an optimal policy based on the estimated action value

function. This is a well-known method that combines the advantages of policy gradient

and value function approximation. The Actor-Critic method contains two separate

systems (actor and critic), and each one of the systems is updated based on the other.

The actor controls the policy to select actions, and the critic estimates the value function.

Thus, after each action is selected from the given policy by the actor, the critic evaluates

the policy using the estimated value function. In [23], it is shown how TD algorithms can

be applied to the critic to estimate the value function, while the policy gradient method

23

is applied to update the actor. Based on the gradient of the value function obtained

from the critic, the policy in the actor is updated. The critic evaluates the value function

given a fixed policy from the actor. However, since the Actor-Critic method includes two

systems, it is challenging to adjust them simultaneously.

On the other hand, Q-learning [55] is a simple online learning method to find an

optimal policy based on the action value function Q. Despite being a simple approach,

Q-learning is commonly used because it is effective, and the agent can be updated

based solely on observations. The basic idea of Q-learning is that when the action value

Q is close to the optimal action value Q∗, the policy, which is greedy with respect to all

action values for a given state, is close to optimal.

Therefore, we can think of extending the proposed TD algorithms (KTD(λ), CTD,

and CKTD) to approximate Q-functions from which we can derive the optimal policy.

In particular, we will introduce Q-CTD, Q-KTD, and Q-CKTD algorithms. Q-learning is

a well known off-policy TD control algorithm; the form of state-action mapping function

(policy) is undetermined, and TD learning is applied to estimate the state-action value

function.

The convergence of Q-learning with function approximation has been a main

concern in its application [51]. [54] showed that it is possible to diverge in Q-learning

with nonlinear function approximation. In addition, [3] pointed out that value based RL

algorithms can become unstable when combined with function approximation.

Despite the above issues, [32] showed convergence properties of Q-learning with

linear function approximation under restricted conditions. Furthermore, the extension

of the gradient temporal difference (GTD) family of learning algorithms to Q-learning

called Greedy GQ [29] results in better convergence properties; the system converges

independently of the sampling distribution. However, Greedy GQ may get stuck in

local minima even with linear function aproximation because the objective function is

non-convex.

24

Although [29, 32] showed the feasibility of applying Q-learning with linear function

approximation, the use of a nonlinear function approximator in Q-learning has not

yet been actively considered mainly because of the lack of convergence guarantees.

However, incorporating the kernel-based representation may bring the advantages of

nonlinear function approximation, and the convergence properties of linear function

approximation in Q-learning would still hold. A convergence result for Q-learning using

linear function approximation by temporal difference (TD)(λ) [50] is introduced in [32].

They proved that when the learning policy and the greedy policy are close enough,

the algorithm converges to a fixed point of a recursion based on the bellman operator

they introduced in [32]. Their convergence result is based on a relation between the

autocorrelation matrices of the basis functions with respect to the learning policy and

the greedy policy. In addition, they assume a compact state space with a finite set of

bounded linearly independent basis functions. In Q-KTD, the representation space is

possibly infinitely dimensional. Therefore, the direct extension of the results from [32]

would require an extended version of the ordinary differential equation (ODE) method to

a Hilbert space valued differential equation.

Since policy is not fixed in Q-learning, it is required that the system explores

the environment and learns under changing policies. The system should respond

accordingly and be able to disregard large changes that may result from exploration. To

address this problem we explore robustness through the maximum correntropy criterion

in the context of changing policies.

As we mentioned above, one of the practical objectives of our work is to apply the

proposed TD algorithms to neural decoding within the reinforcement learning brain

machine interface framework. In the RLBMI structure, the agent learns how to translate

the neural states into actions (direction) based on predefined reward values from the

environment. Since there are two intelligent systems, a BMI decoder in agent and a

BMI user in environment, in closed loop feedback, we can understand the system as a

25

cooperative game. In fact, the BMI user has no direct access to actions, and the agent

must interpret the user’s brain activity correctly to facilitate the rewards [13].

Therefore, the proposed algorithms can be applied to the agent, which decodes

the neural states transforming them to the proper action directions that are in turn

executed by an external device such as a computer screen or a robotic arm. The

updated position of the actuator will influence the user’s subsequent neural states

because of the visual feedback involved in the process. That is how the two intelligent

systems learn co-adaptively and the closed loop feedback is created. In other words,

the input to the BMI decoder is the user’s neural states, which can be considered as the

user’s output. Likewise, the action directions of the external device are the decoder’s

output and because of the visual feedback they can also be considered as the input to

the user.

We will exam the capability of the Q-KTD algorithm both in open and closed

loop Reinforcement Learning Brain Machine Interfaces (RLBMI) to perform reaching

tasks. The closed loop RLBMI experiment will show how the two intelligent systems

co-adaptively learn in a real time reaching task. Note that Q-learning via KTD (Q-KTD)

is powerful in practical applications due to its nonlinear approximation capabilities.

Also, this algorithm is advantageous for real time applications since parameters can be

chosen on the fly based on the observed input states, and no normalization is required.

In addition, we will see the performance of Q-learning via correntropy KTD (Q-CKTD)

in open loop RLBMI experiments, and see how correntropy can improve performance

under changing policies.

The main contribution of this thesis are the three new state value function

approximation algorithms based on temporal difference algorithms: kernel temporal

difference(λ), correntropy temporal difference, and correntropy kernel temporal

difference. The proposed algorithms are extended to find a control policy in reinforcement

learning problems based on Q-learning, and this leads the Q-learning via kernel

26

temporal difference (Q-KTD)(λ), Q-CTD, and Q-CKTD. Moreover, we provide a

theoretical analysis on the convergence and degree of sub-optimality of the proposed

algorithms based on the extension of existing results to the TD algorithm and its

Q-learning counterpart. Furthermore, we test the algorithms to illustrate their behavior

and overall performance both in state value function approximation and policy estimation

problems. Finally, we apply the proposed algorithms to RLBMI showing how the

developed methodology can be useful in relevant practical scenarios.

27

CHAPTER 2REINFORCEMENT LEARNING

We will show the background of RL including the mathematical formulation of the

value function in Markov decision processes, and in the following chapters, we will see

how the temporal differences can be derived for value function estimation and applied to

RL algorithms.

In reinforcement learning, a controller (agent) interacts with a system (environment)

over time and modifies its behavior to improve performance. This performance is

assessed in terms of cumulative rewards, which are assigned based on a task goal. In

RL the agent tries to adjust its behavior by taking actions that will increase the reward

in the long run; these actions are directed towards the accomplishment of the task goal

(Figure 2-1).

Figure 2-1. The agent and environment interaction in reinforcement learning.

Assuming the environment is a stochastic (that is, if a certain state is visited

different times and the same action is taken, the following state may not be the same

each time) and stationary process that satisfies the Markov condition

P(x(n)|x(n − 1), x(n − 2), · · · , x(0)) = P(x(n)|x(n − 1)), (2–1)

it is possible to model the interaction between the learning agent and the environment

as a Markov decision process (MDP). For the sake of simplicity, we assume the states

and actions are discrete, but they can also be continuous. A Markov decision process

(MDP) consists of the following elements:

28

• x(n) ∈ X : states

• a(n) ∈ A : actions

• Raxx ′ : (X × A) × X 7→ R : reward function over states x ′ ∈ X given a state action

pair (x , a) ∈ X ×A,

Raxx ′ = E [r(n + 1)|x(n) = x , a(n) = a, x(n + 1) = x ′] . (2–2)

• Paxx ′ : state transition probability that gives a probability distribution over states X

given a state action pair in X ×A,

Paxx ′ = P(x(n + 1) = x ′|x(n) = x , a(n) = a). (2–3)

At time step n, the agent receives the representation of the environment’s state

x(n) ∈ X as input, and according to this input the agent selects an action a(n) ∈ A. By

performing the selected action a(n), the agent receives a reward r(n + 1) ∈ R, and the

state of the environment changes from x(n) to x(n + 1). The new state x(n + 1) follows

the state transition probability Paxx ′ given the action a(n) and the current state x(n). At

the new state x(n + 1), the process repeats; the agent takes an action a(n + 1), and

this will result in a reward r(n + 2) and a state transition from x(n + 1) to x(n + 2). This

process continues either indefinitely or until a terminal state is reached depending on the

process.

There are two important concepts associated with the agent: the policy and value

functions.

• Policy π : X → A is a function that maps a state x(n) to an action a(n).

• The value function is a measure of long-term performance of an agent following apolicy π starting from a state x(n),

State value function : V π(x(n)) = Eπ [R(n)| x(n)] (2–4)Action value function : Qπ(x(n), a(n)) = Eπ [R(n)| x(n), a(n)] (2–5)

where R(n) is a return.

29

A common choice for the return is the infinite-horizon discounted model

R(n) =

∞∑k=0

γkr(n + k + 1) , 0 < γ < 1 (2–6)

that takes into account the rewards in the long run, but weights them with a discount

factor that prevents the function from growing unbounded as k → ∞ and also provides

mathematical tractability [52].

The objective of RL is to find a good policy that maximizes the expected reward of

all future actions given the current knowledge. Since the value function represents the

expected cumulative reward given a policy, the optimal policy π∗ can be obtained based

on the value function; a policy π is better than another policy π′ when the policy π gives

greater expected return than the policy π′. In other words, π ≥ π′ when V π(x) ≥ V π′(x)

or Qπ(x , a) ≥ Qπ′(x , a) for all x ∈ X and a ∈ A. Therefore, the optimal state value

function V π∗ is defined by V π∗(x(n)) = maxπ V

π(x(n)), and the optimal action value

function Qπ∗ can be obtained by Qπ∗(x(n), a(n)) = maxπ Q

π(x(n), a(n)).

When we have complete knowledge of Raxx ′ and Pa

xx ′, an optimal policy π∗ can be

directly computed using the definition of the value function. For a given policy π, the

state value function V π can be expressed as,

V π(x(n)) = Eπ [R(n)| x(n)] (2–7)

= Eπ

[∞∑k=0

γkr(n + k + 1)

∣∣∣∣∣ x(n)]

(2–8)

= Eπ

[r(n + 1) + γ

∞∑k=0

γkr(n + k + 2)

∣∣∣∣∣ x(n)]

(2–9)

=∑a

π(x(n), a(n))∑x ′

Paxx ′

[Ra

xx ′ + γEπ

[∞∑k=0

γkr(n + k + 2)

∣∣∣∣∣ x ′]]

(2–10)

=∑a

π(x(n), a(n))∑x ′

Paxx ′ [Ra

xx ′ + γV π(x ′)] (2–11)

30

The optimal policy π∗ is obtained by selecting an action a(n) satisfying

V π∗(x) = max

a

∑x ′

Paxx ′

[Ra

xx ′ + γV π∗(x ′)

]. (2–12)

Equation (2–12) is commonly known as the Bellman optimality equation for V π∗ . For the

action value function Qπ∗, the optimality equation can be obtained in a similar fashion

Qπ∗(x , a) =

∑x ′

Paxx ′

[Ra

xx ′ + γmaxa′

Qπ∗(x ′, a′)

]. (2–13)

The solution to the equations (2–12) and (2–13) can be obtained using dynamic

programming (DP) methods. However, this procedure is infeasible when the number

of variables increases due to the exponential growth of the state space (curse of

dimensionality) [52]. RL allows us to find policies which approach the Bellman optimal

policies without explicit knowledge of the environment (Paxx ′ and Ra

xx ′); as we will see

in the following chapter, in reinforcement learning, temporal difference (TD) algorithms

approximate the value functions by learning the parameters using simulations rather

than using the explicit state transition probability Paxx ′ and reward function Pa

xx ′. The

estimated value functions will allow comparisons between policies and thus guide the

optimal policy search.

In this chapter we checked the learning paradigm and basic components of

reinforcement learning. The interaction between agent and environment is an important

feature in RL, and policy and value functions are key concepts in the agent providing the

control; based on the value function, a proper policy can be obtained.

31

CHAPTER 3STATE VALUE FUNCTION ESTIMATION/ POLICY EVALUATION

Value function estimation is an important sub-problem in finding an optimal policy in

reinforcement learning. In this chapter, we will introduce three new temporal difference

algorithms: kernel temporal difference (KTD)(λ), correntropy temporal difference (CTD),

and correntropy kernel temporal difference (CTD). The algorithms will be extended

based on the conventional temporal difference (TD) algorithm called TD(λ), which is a

representative online learning algorithm to estimate the value function.

All of the algorithms listed above use temporal difference (TD) error to update the

system, and given a fixed policy π, the optimal value function can be estimated based

on the TD error. Figure 3-1 shows how the value function can be estimated using an

adaptive system based on the TD error. In an adaptive system, there are two important

elements: the learning algorithm concerned with the class of functions that can be

approximated by the system and the cost function which quantifies the fitness of the

function approximations.

Figure 3-1. Diagram of adaptive value function estimation in reinforcement learning.Given a fixed policy, value function can be estimated based on temporaldifference error.

We propose using a kernel framework for the mapper. The implicit linear mapping

in a kernel space can provide universal approximation in the input space, and many

of the related optimization problems can be posed as convex (no local minima) with

algorithms that are still reasonably easy to compute (using the kernel trick [44]). In

32

addition, we apply correntropy as a cost function to find the optimal solution. Correntropy

is a robust similarity measure between two random variables or signals when heavy

tailed or non-Gaussian distributions are involved [21, 36, 45].

3.1 Temporal Difference(λ)

Temporal difference learning is an incremental learning method specialized for

prediction problems, and it provides an efficient learning procedure that can be applied

to reinforcement learning. In particular, TD learning allows learning directly from new

experience without having a model of the environment. It employs previous estimations

to provide updates to the current predictor.

In [50], the TD(λ) algorithm is derived as the solution to a multi-step prediction

problem. For a multi-step prediction problem, we have a sequence of input-output pairs

(x(1), d(1)), (x(2), d(2)), · · · , (x(m), d(m)), in which the desired output d can only be

observed at time m + 1. Then, a system will produce a sequence of predictions y(1),

y(2), · · · , y(m) based solely on the observed input sequences. In general, the predicted

output is a function of all previous inputs,

y(n) = f (x(1), x(2), · · · , x(n)); (3–1)

here, we assume that y(n) = f (x(n)) for simplicity. The predictor f can be defined

based on a set of parameters w , that is,

y(n) = f (x(n),w). (3–2)

Writing the multi-step prediction problem as a supervised learning problem, the

input-output pairs become (x(1), d), (x(2), d), · · · , (x(m), d), and the update rule at each

time step can be written as,

�wn = η(d − y(n))∇wy(n), (3–3)

33

where, η is the learning rate, and the gradient vector ∇wy(n) contains the partial

derivatives of y(n) with respect to w . As we mentioned above, the desired value of the

prediction d only becomes available at time m + 1, and thus the parameter vector w can

only be updated after m time steps. The update is given by the following expression

w ← w +

m∑n=1

�wn. (3–4)

When the predicted output y(n) is a linear function of x(n), we can write the predictor as

y(n) = w ᵀx(n), for which ∇wy(n) = x(n), and the update rule becomes

�wn = η(d − w ᵀx(n))x(n). (3–5)

The key observation to extend the supervised learning approach to the TD method is

that the difference between desired and predicted output at time n can be written as

d − y(n) =

m∑k=n

(y(k + 1)− y(k)) (3–6)

where y(m+1) , d . Using this expansion in terms of the differences between sequential

predictions, we can update the system at each time step. The TD update rule is derived

as follows:

w ← w +

m∑n=1

�wn (3–7)

= w + η

m∑n=1

(d − y(n))∇wy(n) (3–8)

= w + ηm∑n=1

m∑k=n

(y(k + 1)− y(k))∇wy(n) (3–9)

= w + η

m∑k=1

k∑n=1

(y(k + 1)− y(k))∇wy(n) (3–10)

= w + ηm∑n=1

(y(n + 1)− y(n))

n∑k=1

∇wy(k) (3–11)

34

In this case, all predictions are used equally. By using exponential weighting on

recency, we can emphasize more recent predictions, and this yields the following update

rule, which is called the eligibility trace;

�wn = η(y(n + 1)− y(n))

n∑k=1

λn−k∇wy(k). (3–12)

The eligibility trace is a common method used in RL to deal with delayed reward; it

allows propagating the rewards backward over the current state without remembering

the trajectory explicitly. Expression (3–12) is known as the TD(λ) update rule [50], and

the difference between predictions of sequential inputs is called TD error

eTD(n) = y(n + 1)− y(n). (3–13)

Note that when λ = 0, the update rule becomes as

w ← w + ηm∑n=1

(y(n + 1)− y(n))x(n), (3–14)

and this is the same form as LMS expect that the error term is substituted by the

incremental difference in the outputs for the error term.

In supervised learning, the predictor can only be updated once the error (difference

between predicted output and desired signal) is available. Therefore, in the multi-step

prediction problem, the system could not be updated until the error was available at the

reward time, which becomes available only in the future at time m + 1. In contrast, the

TD algorithm allows system updates directly from the sequence of states. Therefore,

online learning becomes possible without having the desired signal available at all times.

This allows efficient learning in most real world prediction problems; TD learning has

lower memory and computational demands than supervised learning, and empirical

results show that TD(λ) can provide more accurate predictions [50].

35

3.1.1 Temporal Difference(λ) in Reinforcement Learning

Now, let us see how the TD(λ) algorithm is employed in RL. When we consider the

prediction y as state value function V π given a fixed policy π, TD(λ) can approximate the

state value function ~V π using a parametrized family of functions of the form

~V (x(n)) = w⊤x(n) (3–15)

with parameter vector w ∈ Rd . For convenience, we use V to denote V π unless we need

to indicate different policies. Note that the objective of TD(λ) is to minimize the mean

square error (MSE) criterion,

minE[(V (x(n))− ~V (x(n)))2

]. (3–16)

Based on (2–9), we can obtain an approximate form of the recursion involved in the

Bellman equation as follows:

~V (x(n)) ≈ r(n + 1) + γ ~V (x(n + 1)). (3–17)

Thus, the TD error at time n (3–13) can be associated with the following expression

eTD(n) = r(n + 1) + γV (x(n + 1))− V (x(n)), (3–18)

and the error term (3–18) combined with (3–12) gives us the following per time-step

update rule;

�wn = η(r(n + 1) + γV (x(n + 1))− V (x(n)))

n∑k=1

λn−k∇wV (k). (3–19)

Algorithm 1 shows pseudo code for the implementation of the TD(λ) algorithm for

linear value function approximation. The algorithm assumes the following information to

be given:

• a fixed policy π in MDP

• a discount factor γ ∈ [0, 1]

36

• a parameter λ ∈ [0, 1]

• a sequence of stepsizes η1, η2, · · · for incremental coefficient updating

Algorithm 1 pseudo code of TD(λ) algorithm in reinforcement learningSet w = 0 (or an arbitrary estimate)Set n = 1for n ≥ 1 doz(n) = x(n), where x(n) ∈ X is a start statewhile (x(n) ̸= terminal state) do

Simulate one step process producing reward r(n + 1) and next state x(n + 1)�wn = (r(n + 1) + γw ᵀx(n + 1)− w ᵀx(n))z(n)w ← w + ηn�wn

z(n + 1) = λz(n) + x(n + 1)n = n + 1

end whileend for

At each state transition, the algorithm computes one step TD error r(n + 1) +

γw ᵀx(n + 1) − w ᵀx(n), and depending on the eligibilities z(n) =∑n

k=n0λn−kx(k), a

portion of TD error is propagated back to update the system.

3.1.2 Convergence of Temporal Difference(λ)

We will see that in the case of λ = 0 and λ = 1, the TD solutions converge

asymptotically to the ideal solution under given conditions in an absorbing Markov

processes. For all other cases λ ̸= 1, the solution also converges, but the answer

is different in general from the one given by the least mean squares algorithm.

Remember that the conventional TD algorithm assumes that the function class is

linearly parametrized satisfying y = w⊤x . This assumption will be also considered in the

convergence proof for TD with any 0 ≤ λ ≤ 1.

• λ = 1 case

The TD(1) procedure is equivalent to (3–11), and this gives the same per-sequence

weight changes as the supervised learning method since (3–11) is derived by directly

replacing the error term in supervised learning using (3–6) (Theorem 3.1).

37

Theorem 3.1. On multistep prediction problems, the linear TD(1) procedure produces

the same pre-sequence weight changes as the Widrow-Hoff rule [50].

• λ = 0 case

The convergence result for linear TD(0) presented in [50] is proved under the assumption

that the dynamic system providing the states correspond to an absorbing Markov

process. In an absorbing Markov process, there is a set of terminal states T , a set of

non-terminal states N , and transition probabilities pij where i ∈ N , j ∈ N ∪ T . The

transition probabilities are set such that a terminal state will be visited in a finite number

of state transitions. Here, we assume that an initial state is selected with probability µi

among non-terminal states.

Given the initial state x(i), an absorbing Markov process generates a state

sequence x(i), x(i + 1), · · · , x(j) where x(j) ∈ T . At the terminal state x(j), the

output d is selected from an arbitrary probability distribution with expected value �dj .

This absorbing Markov process converges to the desired behavior asymptotically based

on experience. Here, the desired behavior is to map each non-terminal state x(i) to

the expected outcome d given the sequence starting from i ; thus, the ideal predictions

y(i) = f (x(i),w) should be equal to E [d |x(i)], ∀i ∈ N . For a completed sequence, we

have the following relation

E [d |x(i)] =∑j∈T

pij�dj +∑j∈N

pij∑k∈T

pjk�dk +∑j∈N

pij∑k∈N

pjk∑l∈T

pkl�dl + · · · (3–20)

=

[∞∑k=0

Qkh

]i

(3–21)

=[(I −Q)−1h

]i, (3–22)

where [Q]ij = pij for i , j ∈ N , and [h]i =∑

j∈T pij�dj for i ∈ N .

The following theorem shows that TD(0) converges to the ideal predictions for the

appropriate step size when the states {x(i)|i ∈ N} are linearly independent.

38

Theorem 3.2. For any absorbing Markov chain, for any distribution of starting probability

µi , for any outcome distributions with finite expected values �dj , and for any linearly

independent set of observation vectors {x(i)|i ∈ N}, there exists an ϵ > 0 such that, for

all positive η < ϵ and for any initial weight vector, the predictions of linear TD(0) converge

in expected value to the ideal predictions. That is, if wn denotes the weight vector after n

sequences have been experienced, then limn→∞ E [w⊤n x(i)] = E [d |x(i)] = [(I − Q)−1h]i ,

∀i ∈ N [50].

• 0 < λ < 1 case

The work in [10] extended the convergence of TD to general λ. The TD update rule

(3–11) can be expressed as

�w sn+1 = �w s

n + ηXD[QsX⊤ �w sn − X⊤ �w s

n + (Qs−1 +Qs−2 + · · ·+ I )h], (3–23)

where �w are the expected weights, X is a state matrix defined as [X ]ab = [xa]b, where

a runs over the states and b on the dimensions. D is a diagonal matrix satisfying

[D]ab = δabda, where δab is the Kronecker delta and da is expected number of times the

Markov chain is in state xa in one sequence, and s shows the number of state transitions

being traced.

After multiplying X⊤ on both sides of (3–23), when we reorganize the equation

using (I +Q +Q2 + · · ·+Qs−1)h = (I −Qs)E [d |x(i)] and �wλn = (1− λ)

∑∞s=1 λ

s−1 �w sn , we

can obtain the following equation

X⊤ �wλn+1 = X⊤ �wλ

n − ηX⊤XD[I − (1− λ)Q(I − λQ)−1](X⊤ �wλn − E [d |x(i)]). (3–24)

When the state representations are linearly independent, X has a full rank, and the right

term of (3–24)

− X⊤XD[I − (1− λ)Q(I − λQ)−1](X⊤ �wn − E [d |x(i)]) (3–25)

39

has a full set of nonzero eigenvalues whose real parts are negative. Therefore, if the

above conditions hold, it can be shown that TD(λ) converges with probability 1 using

Theorem 3.3 [24].

Theorem 3.3. Let {y(n)} be given by

y(n + 1) = y(n) + ηn(g(y(n)) + βn + ξn) (3–26)

satisfying the following assumptions

1.. g is a continuous Rd valued function on Rd .

2.. {βn} is a bounded with probability 1 sequence of Rd valued random variablessuch that βn → 0 with probability 1.

3.. {ηn} is a sequence of positive real numbers such that ηn → 0,∑

n ηn =∞.

4.. {ξn} is a sequence of Rd valued random variables and such that for some T > 0and each ϵ > 0

limn→∞

P

supj≥n

maxt≤T

∣∣∣∣∣∣m(jT+t)−1∑i=m(jT )

ηiξi

∣∣∣∣∣∣ ≥ ε

= 0

where m(t) is defined by max{n : tn ≤ t} for t ≥ 0 and tn =∑n−1

i=0 ηi .

Also, let {y(n)} be bounded with probability 1. Then, there is a null set 0 such that

ω /∈ 0 implies that {y n(·)} is equicontinuous, and also that the limit y(·) of any

convergent subsequence of {y n(·)} is bounded and satisfies the ordinary differential

equation (ODE)

y ′ = g(y) (3–27)

on the time interval (−∞,∞). Let y0 be a locally asymptotically stable (in the sense of

Liapunov) solution to (3–27) with domain of attraction DA(y0). Then, if ω /∈ 0 and there

is a compact set A ⊂ DA(y0) such that y(n) ∈ A infinitely often, we have y(n) → y0 as

n →∞ [24].

When we set y(n) in (3–26) as X⊤wn, (3–25) satisfies the ordinary differential

equation (3–27). Therefore, under the assumptions in Theorem 3.3, the differential

40

equation is asymptotically stable to E [d |x(i)], that is, wn → w ∗ as n →∞ with probability

1 where X⊤w ∗ = E [d |x(i)] and X is full rank.

3.2 Kernel Temporal Difference(λ)

In the previous section, we introduced TD(λ), and we observed how the value

function can be estimated adaptively. Note that the TD(λ) approximates the value

function using a linear function, which may be limited in practice.

As an alternative, algorithms with nonlinear approximation capabilities have

become a topic of growing interest. Nonlinear variants of TD algorithms have also

been proposed, and they are mostly based on time delay neural networks, sigmoidal

multilayer perceptrons, or radial basis function networks. Despite their good approximation

capabilities, these algorithms are usually prone to fall into local minima [3, 7, 20, 54],

which does not guarantee the optimality of TD(λ). Kernel methods have become

an appealing choice due to their elegant way of dealing with nonlinear function

approximation problems; the kernel based algorithms have nonlinear approximation

capabilities, yet the cost function can be convex [44].

In the following, we will show how the conventional TD(λ) algorithm can be

extended using kernel functions to obtain nonlinear variants of the algorithm; we

introduce a kernel adaptive filter implemented with stochastic gradient on temporal

differences called kernel temporal difference (KTD)(λ).

3.2.1 Kernel Methods

The basic idea of kernel methods is to nonlinearly map the input data to a high

dimensional feature space of vectors. Let X be a nonempty set. For a positive definite

function κ : X × X → R [28, 44], there exists a Hilbert space H and a mapping

ϕ : X → H such that

κ(x , y) = ⟨ϕ(x),ϕ(y)⟩. (3–28)

The inner product in the high dimensional feature space can be calculated by evaluating

the kernel function in the input space. Here, H is called a reproducing kernel Hilbert

41

space (RKHS) because it satisfies the following property

f (x) = ⟨f ,ϕ(x)⟩ = ⟨f ,κ(x , ·)⟩,∀f ∈ H. (3–29)

The mapping implied by the use of the kernel function can also be understood

through Mercer’s Theorem (Appendix A) [33]. These properties allow us to transform

conventional linear algorithms in the feature space to non-linear systems without

explicitly computing the inner product in the high dimensional space.

3.2.2 Kernel Temporal Difference(λ)

In supervised learning, a stochastic gradient solution to least squares function

approximation using a kernel method called kernel least mean square (KLMS) is

introduced in [27]. The KLMS algorithm attempts to minimize the risk functional

E[(d − f (x))2

]by minimizing the empirical risk J(f ) =

∑N

n=1(d(n) − f (x(n)))2 on

the space H induced by the kernel κ. Using (3–29), we can rewrite

J(f ) =

N∑n=1

[d(n)− ⟨f ,ϕ(x(n))⟩]2 (3–30)

By differentiating the empirical risk J(f ) with respect to f and approximating the sum by

the current difference (stochastic gradient), we can derive the following update rule

f0 = 0

fn = fn−1 + ηe(n)ϕ(x(n))(3–31)

where e(n) = d(n)− fn−1(x(n)), which corresponds to KLMS algorithm [27]. Given a new

state x(n), the output can be calculated using the kernel expansion,

fn−1(x(n)) = fn−2(x(n)) + ηe(n − 1)κ⟨x(n − 1), x(n)⟩ (3–32)

= ηn−1∑k=1

e(k)κ⟨x(k), x(n)⟩. (3–33)

As we mentioned above, for a multi-step prediction problem, we can simply say

y(n) = f (x(n)). Let the function f belong to an RKHS H as in KLMS. By treating the

42

observed input sequence and the desired prediction as a sequence of pairs (x(1), d),

(x(2), d), · · · , (x(m), d) and making d , y(m + 1), we can obtain the updates of function

f after the whole sequence of m inputs has been observed as

f ← f +

m∑n=1

�fn (3–34)

= f + η

m∑n=1

e(n)ϕ(x(n)) (3–35)

= f + η

m∑n=1

[d − f (x(n))]ϕ(x(n)). (3–36)

Here, �fn = η[d − ⟨f ,ϕ(x(n))⟩]ϕ(x(n)) are the instantaneous updates of the function

f from input data based on the kernel expansions (3–29). By replacing the error d −

f (x(n)) using the relation with temporal differences (3–6) and reorganizing the equation

(3–36) as in the TD(λ) derivation from [50], we can obtain the following update

f ← f + η

m∑n=1

[f (x(n + 1))− f (x(n))]

n∑k=1

ϕ(x(k)), (3–37)

and generalizing for λ yields

f ← f + ηm∑n=1

[f (x(n + 1))− f (x(n))]

n∑k=1

λn−kϕ(x(k)). (3–38)

The temporal differences f (x(n + 1)) − f (x(n)) can be rewritten using the kernel

expansions as ⟨f ,ϕ(x(n + 1))⟩ − ⟨f ,ϕ(x(n))⟩. This yields

f ← f + η

m∑n=1

⟨f ,ϕ(x(n + 1))− ϕ(x(n))⟩n∑

k=1

λn−kϕ(x(k)), (3–39)

where ��fn = η⟨f ,ϕ(x(n + 1)) − ϕ(x(n))⟩∑n

k=1 λn−kϕ(x(k)). This update rule (3–39)

is called kernel temporal difference (KTD)(λ) [1, 2]. Using the RKHS properties, the

evaluation of the function f at a certain x can be calculated as a kernel expansion.

When λ = 0, the update rule becomes

f ← f + ηm∑n=1

⟨f ,ϕ(x(n + 1))− ϕ(x(n))⟩ϕ(x(n)), (3–40)

43

and it is noticeable that the update rule is exactly of the same form as KLMS (3–36)

except for the error terms; in supervised learning, the error is defined as the difference

between desired signal and predictions at time n, whereas in TD learning, the error is

the difference between sequential predictions.

In addition, equation (3–39) can be modified for state value function approximation

by replacing the error term using (3–18);

f ← f + η

m∑n=1

[r(n + 1) + γV (x(n + 1))− V (x(n))]

n∑k=1

λn−kϕ(x(k)) (3–41)

= f + ηm∑n=1

[r(n + 1) + ⟨f , γϕ(x(n + 1))− ϕ(x(n))⟩]n∑

k=1

λn−kϕ(x(k)). (3–42)

3.2.3 Convergence of Kernel Temporal Difference(λ)

Based on the convergence guarantees for TD(λ), we are able to extend the result to

the convergence of KTD(λ).

• λ = 1 case

Theorem 3.1 shows that in the case of TD with λ = 1, its solution converges to the same

solution as the supervised learning (least square) due to the derivation of TD update

rule based on (3–6). We can also use this relation to show the convergence of KTD(1).

[27] proved the following proposition;

Proposition 3.1. The KLMS algorithm converges asymptotically in the mean sense to

the optimal solution under the “small-step-size” condition [27].

In a multi-step prediction problem, KTD(1) is derived by replacing the error in

supervised learning with the TD error term using (3–6). Thus, we obtain the following

theorem;

Theorem 3.4. On multi-step prediction problems, the KTD(1) procedure produces the

same pre-sequence weight changes as the least square solution.

44

Proof. Since by (3–6) the sequence of TD errors can be replaced by a multistep

prediction with error e(n) = d − y(n), the result of Proposition 3.1 also applies in this

case.

This means that KTD(1) also asymptotically converges to the optimal solution when

the stepsize satisfies∑

n ηn =∞ and∑

n η2n <∞ for n ≥ 0.

• λ < 1 case

For general λ cases (λ < 1), we saw that the convergence of TD heavily relies on the

state representation x(n); the convergence is proved given that the state feature vectors

are linearly independent (Theorem 3.4 and 3.3).

Many models can be reformulated using a dual representation, and this idea

naturally arises when using kernel functions. We derived KTD(λ) using the dual

representation to express the solution of TD(λ) in terms of the kernel function. Note

that the weight vector in the RKHS can be expressed as the linear combination of the

feature vectors ϕ(x) (Proposition 3.2).

Therefore, we can extend Theorem 3.2 and the convergence proof to TD(0 < λ < 1)

to KTD(λ < 1) by showing that the feature map ϕ creates a representation of states in

the RKHS satisfying the linear independence assumption when the kernel κ is strictly

positive definite. This implies that the convergence guarantee of TD(λ < 1) can be

extended to KTD(λ < 1) when it is viewed as a linear function approximator in the

RKHS.

Proposition 3.2. If κ : X × X → R is a strictly positive definite kernel, for any finite set

{xi}Ni=1 ⊆ X of distinct elements, the set {ϕ(xi)} is linearly independent.

Proof. If κ is strictly positive definite, then∑

αiαjκ(xi , xj) > 0 for any set xi where xi ̸= xj ,

∀i ̸= j , and any αi ∈ R such that not all αi = 0. Suppose there exists a set {xi} for which

{ϕ(xi)} are not linearly independent. Then, there must be a set of coefficients αi ∈ R not

45

all equal to zero such that∑

αiϕ(xi) = 0, which implies that ∥∑

αiϕ(xi) ∥2= 0

0 =∑

αiαj⟨ϕ(xi),ϕ(xj)⟩ (3–43)

=∑

αiαjκ(xi , xj), (3–44)

which contradicts the assumption.

This shows that if a strictly positive definite kernel is used, the condition of linearly

independent state representations is satisfied in KTD(λ). This is a necessary condition

for convergence of TD(0) in Theorem 3.2 and TD(0 < λ < 1) based on the ODE

representation from Theorem 3.3.

3.3 Correntropy Temporal Differences

In the previous chapter, we focused our attention on the functional mapper of the

adaptive system. In the present chapter, we will turn our attention towards the cost

function. A common issue in practical scenarios is that the assumptions about noise or

the model may not hold or are subject to perturbations. Most studies on TD algorithms

showing the performance on synthetic experiments such as the noiseless Markov

chain or random walk problems do not usually address the issue of how noise or small

perturbations to the model affect performance. In practice, noisy state transitions

or rewards may be observed, and noise may even be present in the input state

representations. Highly noise-corrupted environments lead to difficulties in learning,

and this may result in failure to obtain the desired behavior of the controller.

One of the most popularly utilized figures of merit is the mean square error

(MSE), which is a second order statistic, and methods such as TD(λ) and KTD(λ)

use this criterion. It is well known that the MSE criterion is most useful only under

Gaussianity assumptions [20]; nevertheless, any departure from this behavior can affect

performance significantly. Correntropy [42] is an alternative to MSE that has been shown

to be able to deal with situations where the Gaussianity does not hold. One of the main

features of correntropy as a cost function is its robustness to large perturbations in the

46

learning process; performance improvements over MSE in many realistic scenarios

including fat-tail distributions and severe outlier noise have been demonstrated in

[26, 59].

3.3.1 Correntropy

The generalized correlation function called correntropy was first introduced in [42].

Correntropy is defined in terms of inner products of vectors in a kernel feature space

B(X ,Y ) = E [κ(X − Y )] (3–45)

where X and Y are two random variables, and κ is a translation invariant kernel. When

κ is the Gaussian kernel, the Taylor series expansion of correntropy is given by

B(X ,Y ) =1√2πhc

∞∑n=0

(−1)n

2nh2nc n!E [∥X − Y ∥2n]. (3–46)

This expansion shows that correntropy includes all the even-order moments of the

random variables ∥X − Y ∥. A different kernel can lead to a different expansion, but

what it is noticeable is that by using a nonlinear kernel, correntropy contains information

beyond second order statistics of the statistical distribution, and thus it is better suited

for non-linear and non-Gaussian signal processing. It has also been observed that in an

impulsive noise environment, correntropy can obtain performance improvements over

the conventional MSE criterion [26]

MSE(X ,Y ) = E [(X − Y )2]. (3–47)

The geometric meaning of correntropy in the sample space can be explained

through the correntropy induced metric (CIM). The correntropy induced metric (CIM) is

defined as follows

CIM(X ,Y ) = (κ(0)− B(X ,Y ))1/2, (3–48)

where Gaussian kernel κ(x , y) = exp(

−∥x−y∥22h2

c

)is used, and input space vectors are

X = (x1, x2, · · · , xN)⊤ and Y = (y1, y2, · · · , yN)⊤.

47

CIM h=0.2

e1=y

1−d

1

e 2=y 2−

d 2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 3-2. Contours of CIM(X , 0) in 2 dimensional sample space.

Figure 3-2 shows the behavior of the CIM based on a sample {(x1, y1), (x2, y2)} of

size N = 2. Here, kernel size hc = 0.2 is applied. Whereas MSE measures the L2-norm

distance between random variables with finite variance, CIM based on the Gaussian

kernel approximates the L2-norm distance only for points that are close, and as points

get further apart the metric goes to a transition phase where it resembles L1-norm

distance and finally approaches the L0-“norm” for points that are far away.

Notice that if only one of the errors is large, CIM does not change as long as the

other error is small. This behavior shows how CIM can effectively deal with outliers.

Furthermore, the kernel bandwidth hc controls the scale of CIM norm; a smaller kernel

size enlarges the region for the L0-“norm”, and a larger kernel size extends the L2-norm

area. Thus, selecting a proper kernel size is necessary.

3.3.2 Maximum Correntropy Criterion

Correntropy can be used as a cost function, and it has been applied to adaptive

systems [45, 46, 59]. Let κ the shift invariant kernel employed in correntropy. The cost

48

function can be written as

J = E [κ(e)] (3–49)

≈ 1

N

N∑n=1

κ(e(n)). (3–50)

For a system described by parametric mapping y = f (x |θ), the parameter set θ can be

adapted such that the correntropy of error signal d − y is maximized. This is called the

maximum correntropy criterion (MCC)

MCC = maxθ

N∑n=1

κ(e(n)). (3–51)

The MCC can be understood in the context of M-estimation, which is a generalized

maximum likelihood method to estimate parameters θ under the cost function

minθ

N∑n=1

ρ(e(n)|θ), (3–52)

where ρ is a differentiable function satisfying ρ(e) ≥ 0, ρ(0) = 0, ρ(e) = ρ(−e), and

ρ(e(i)) ≥ ρ(e(j)) for |e(i)| > |e(j)|. This general estimation is equivalent to a weighted

least square problem

minθ

N∑n=1

w(e(n))e(n)2, (3–53)

where w(e) = ρ′(e)/e and ρ′ is the derivative of ρ. When ρ(e) = (1−exp(−e2/2h2c))/√2πhc ,

the generalized likelihood problem becomes MCC. The relation to the weighted least

squares problem becomes obvious by looking at the gradient of J, for which a Gaussian

weighting term places more emphasis on small errors, and diminishes the effect of large

errors. This property is key to the robustness to outliers or sudden perturbations in the

error. Notice that the kernel size still controls the weights.

3.3.3 Correntropy Temporal Difference

A variant of the least mean square (LMS) algorithm [45] using MCC has been

formulated in supervised learning. Similar to the MSE criterion, a stochastic gradient

49

ascent approach can be used to maximize correntropy between desired signal d(n) and

the system output y(n). Let G denote the Gaussian kernel employed by correntropy. The

gradient of the cost function is expressed as follows

∇Jn =∂B(d(n), y(n))

∂w=

∂G(e(n))∂e(n)

· ∂e(n)∂w

(3–54)

=1

h2ce(n)G(e(n)) · ∇wy(n). (3–55)

In addition, in the previously described multi-step prediction problem, temporal

difference (TD) error can be linked to the LMS algorithm by using the recursion (3–6).

Therefore, we can also apply TD with MCC as follows;

w ← w + ηm∑n=1

�wn (3–56)

= w + η

m∑n=1

[e(n)G(e(n))∇wy(n)] (3–57)

= w + η

m∑n=1

[m∑

k=n

e(k)exp

(−(

∑m

k=n e(k))2

2h2c

)x(n)

], (3–58)

where e(n) = y(n + 1)− y(n) when y(n) is a linear function of x(n).

In the case of λ = 0, we saw that supervised learning algorithms and their extended

TD algorithms have exactly the same form of update rule except for the error terms.

Thus, we can obtain a direct extension for correntropy temporal difference (CTD) as

follows

w ← w + η

m∑n=1

[(y(n + 1)− y(n))exp

(−((y(n + 1)− y(n))2

2h2c

)x(n)

]. (3–59)

Equation (3–59) also satisfies the weight updates in the case of single step prediction

problems (when m = 1).

3.3.4 Correntropy Kernel Temporal Difference

Using the ideas of both kernel least mean square (KLMS) and maximum correntropy

criterion, kernel maximum correntropy (KMC) is introduced in [59]. Again, to maximize

50

the error signal correntropy, we can use stochastic gradient ascent, and the updates

to the system are based on the positive gradient of the new cost function in the feature

space. Thus, in KMC, the gradient can be expressed as follows;

∇Jn =∂B(d(n), y(n))

∂f=

∂G(e(n))∂e(n)

· ∂e(n)∂f

(3–60)

=1

hc2e(n)G(e(n)) · ϕ(x(n)). (3–61)

Also, the estimated function at time n + 1 can be obtained as

f0 = 0 (3–62)

fn+1 = fn + η∇Jn (3–63)

= fn + η

[exp

(−e(n)2

2h2c

)e(n)ϕ(x(n))

](3–64)

= fn−1 + η

n∑i=n−1

[exp

(−e(i)2

2h2c

)e(i)ϕ(x(i))

](3–65)

= ηn∑

i=1

[exp

(−e(i)2

2h2c

)e(i)ϕ(x(i))

](3–66)

Again, by using the error relation in supervised and TD learning in (3–6), in the

multistep prediction problem, the temporal difference(TD) error can be integrated in KMC

as follows

f ← f + ηm∑n=1

[exp

(−(

∑m

k=n(y(k + 1)− y(k)))2

2h2c

) m∑k=n

(y(k + 1)− y(k))ϕ(x(n))

].

(3–67)

In the case of λ = 0, we saw that the KLMS (3–36) and KTD(0) (3–40) update rules

have exactly the same form except for the error terms. Thus, we can derive correntropy

kernel temporal difference (CKTD) as follows

f ← f + η

m∑n=1

[exp

(−(y(n + 1)− y(n))2

2h2c

)(y(n + 1)− y(n))ϕ(x(n))

]. (3–68)

This equation also satisfies (3–67) in the case of single step predictions (m = 1).

51

Note that compared to TD(0) (3–14) and KTD(0) (3–40), the only difference

between the CTD (3–59) and CKTD (3–68) update rules is the extra weighting term

which is the exponential of the error. Therefore, the stability result from Theorem 3.3

should also apply in the case of correntropy since the extra weighting term can be

factor together with the stepsize. This should not change the conditions on the stepsize

sequence, ηn → 0,∑

n ηn = ∞, since we employ the Gaussian kernel for correntropy

satisfying 0 ≤ G(e) ≤ 1. Nevertheless, the convergence points for correntropy TD and

TD will be different in general.

In this chapter, three new temporal difference algorithms were introduced for

state value function estimation. First, an algorithm that combines kernel based

representations with conventional TD learning, kernel temporal difference (KTD)(λ), was

introduced. One of the key advantages of KTD(λ) is its nonlinear function approximation

capability in the input space with convergence guarantees. Because of the linear

structure of the computations that are implicitly carried out in the feature space through

the kernel expansion, existing results on linear function approximation can be extended

to the kernel setting. Following this, the maximum correntropy criterion (MCC) as a

robust alternative to MSE was applied to TD(λ) and KTD(λ) algorithms. We introduced

the correntropy temporal difference (CTD) and correntropy kernel temporal difference

(CTD) algorithms. These algorithms are shown to be stable and robust under noisy

conditions. Note that nonlinear function approximation capabilities and robustness are

appealing properties for practical implementations. Learning methods with nonlinear

function approximation capabilities has been the subject of active research. However,

the lack of convergence guarantees has been an issue that makes this avenue less

attractive for real applications. A powerful aspect of KTD(λ) is due to its approximation

mechanism which overcomes the convergence issue.

52

CHAPTER 4SIMULATIONS - POLICY EVALUATION

In this chapter, we examine the empirical performance of the temporal difference

algorithms introduced in the previous sections to the problem of state value function

estimation ~V given a fixed policy π.

First of all, we carry out experiments on a simple illustrative Markov chain described

in [6]; we refer to this problem as the Boyan chain problem. This is a popular experiment

involving an episodic task to test TD learning algorithms. The experiment is useful in

illustrating linear as well as nonlinear functions of the state representations, and shows

how the state value function is estimated using adaptive systems. TD(λ) and KTD(λ)

are compared in the linear and nonlinear function approximation problem. Furthermore,

TD(λ), KTD(λ), CTD, and CKTD are applied to a noisy environment where the policy

does not remain fixed but is randomly perturbed.

4.1 Linear Case

To test the efficacy of the proposed method, we first observe the performance

on a simple Markov chain (Figure 4-1). There are 13 states numbered from 12 to

0. Each trial starts at state 12 and terminates at state 0. Each state is represented

by a 4-dimensional vector, and the rewards are assigned in such a way that the

value function V is a linear function on the states; namely, V ∗ takes the values

[0,−2,−4, · · · ,−22,−24] at states [0, 1, 2, · · · , 11, 12]. In the case of V = w⊤x , the

optimal weights are w ∗ = [−24,−16,−8, 0].

To assess the performance of the algorithms, the updated estimate of the state

value function ~V (x) is compared to the optimal value function V ∗ at the end of each trial.

This is done by computing the RMS error of the value function over all states

RMS =

√1

n

∑x∈X

(V ∗(x)− ~V (x)

)2, (4–1)

where n is the number of states, n = 13.

53

Figure 4-1. A 13 state Markov chain [6]. For states from 2 to 12, the state transitionprobability is 0.5 and the corresponding reward −3. State 1 has statetransition probability of 1 to the terminal state 0 and reward −2. State 12, 8,4, and 0 have 4-dimensional state space representations [1, 0, 0, 0],[0, 1, 0, 0], [0, 0, 1, 0], and [0, 0, 0, 1] respectively, and the representations ofthe other states are linear interpolations between the above vectors.

We saw in the previous chapter that the stepsize is required to satisfy η(n) ≥

0,∑∞

n=1 η(n) = ∞, and∑∞

n=1 η(n)2 < ∞ to guarantee convergence. Consequently, the

following stepsize scheduling is applied;

η(n) = η0a0 + 1

a0 + n, where n = 1, 2, · · · . (4–2)

where η0 is the initial stepsize, and a0 is the annealing factor which controls how fast the

stepsize decreases. In this experiment, a0 = 100 is applied. Furthermore, we assume

that the policy π is guaranteed to terminate, which means that the value function V π is

well-behaved without using a discount factor γ in (2–6); that is, γ = 1.

Using the above set up, we first apply TD(λ) to estimate the value function

corresponding to the Boyan chain (Figure 4-1). To obtain the optimal parameters,

various combinations of eligibility trace rates λ and initial step size η0 values are

evaluated. Eligibility trace rates λ from 0 to 1 with 0.2 jumps and initial stepsizes η0

between 0.1 and 0.9 with 0.1 intervals are observed for 1000 trials (Figure 4-2). The

RMS error of the value function are averages over 10 Monte Carlo runs, and the initial

weight vector is set as w = 0 at each run.

Across all values of λ with optimal stepsize, TD(λ) provides good approximation to

V ∗ after 1000 trials. We observe that small stepsizes generally give better performance.

54

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

η0

= 0.1

η

η

η

η

η

η

η

η

Figure 4-2. Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in TD(λ). The plotted vertical line segments are themean RMS values after 100 trials (top markers), 500 trials (middle markers),and 1000 trials (bottom markers).

However, if the stepsize is very small, the system fails to reach a good performance

level, especially with small λ values (λ = 0, 0.2, and 0.4). We know that stepsize mainly

controls the trade off between performance accuracy and speed of learning, so small

stepsize learning may be too slow to converge within 1000 trials. Also, large stepsizes

result in larger error due to mis-adjustment. Based on Figure 4-2, parameter values of

λ = 1 and η0 = 0.1 are selected for further observation.

Before we extend the experiment, we want to observe the behavior of KTD(λ) in

a linear function approximation problem. We previously emphasized the capability of

KTD(λ) as a nonlinear function approximator, however, under the appropriate kernel

size, KTD(λ) should approximate linear functions well on a region of interest.

In KTD(λ), we employ the Gaussian kernel,

κ(x(i), x(j)) = exp

(−∥x(i)− x(j)∥2

2h2

), (4–3)

which is a universal kernel commonly encountered in practice. To find the optimal kernel

size, we fix all the other free parameters around median values, λ = 0.4 and η0 = 0.5,

and the average RMS error over 10 Monte Carlo runs is compared (Figure 4-3).

For this specific experiment, it seems obvious that smaller kernel sizes yield better

performance, since the state representations are finite. However, in general, applying

55

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Kernel size, h

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

Figure 4-3. Performance over different kernel sizes in KTD(λ). The vertical linesegments contain the mean RMS values after 100 trials (top markers), 500trials (middle markers), and 1000 trials (bottom markers).

too small kernel sizes leads to over-fitting or in this case to slow learning. In particular,

choosing a very small kernel leads to a procedure very similar to a table look up method.

Thus, we choose the kernel size h = 0.2 to be the largest kernel size for which we obtain

similar mean RMS values as those for h = 0.1 and h = 0.05 at 1000th trial, and the

lowest mean RMS at the 100th trial.

After fixing the kernel size at h = 0.2, the experimental evaluation of different

combinations of eligibility trace rates λ and initial step sizes η0 are observed. Figure 4-4

shows the average performance over 10 Monte Carlo runs for 1000 trials.

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

η0

= 0.1

η

η

η

η

η

η

η

η

Figure 4-4. Performance comparison over different combinations of eligibility trace ratesλ and initial step sizes η0 in KTD(λ) with h = 0.2. The plotted vertical linesegments contain the mean RMS value after 100 trials (top marker), 500trials (middle marker), and 1000 trials (bottom marker).

56

All λ values with optimal stepsize show good approximation to V ∗ after 1000 trials.

Smaller stepsizes with larger λ values show better performance in TD(Figure 4-2),

whereas larger stepsizes with smaller λ performs better in KTD. Notice that KTD(λ = 0)

shows slightly better performance than KTD(λ = 1), this may be attributed to the local

nature of KTD when using the Gaussian kernel. In addition, varying the stepsize has a

relatively small effect on KTD(λ). Again the Gaussian kernel as well as other normalized

kernels provide an implicit normalized update rule which is known to be less sensitive to

step size. Based on the Figure 4-4, the optimal eligibility trace rate and initial stepsize

values λ = 0.6 and η0 = 0.3 is selected for KTD with kernel size h = 0.2.

The learning curves of TD(λ) and KTD(λ) are compared. The optimal parameters

are employed in both algorithms based on the experimental evaluation (λ = 1 and

η0 = 0.1 for TD and λ = 0.6 and η0 = 0.3 for KTD), and the RMS error is averaged over

50 Monte Carlo runs for 1000 trials. Comparative learning curves are given in (Figure

4-5).

Both algorithm reach the mean RMS value of around 0.06. Here, we confirmed

the ability of TD(λ) and KTD(λ) to handle the function approximation problem when the

fixed policy yields a state value function that is linear in the state representation. As we

expected, TD(λ) converges faster to the optimal solution because of the linear nature

of the problem. KTD(λ) converges slower than TD(λ), but it is also able to approximate

the value function properly. In this sense KTD is open to wider class of problems than its

linear counterpart.

Also, the estimated state values ~V for the last 50 trials are observed in Figure 4-6. It

shows that both TD(λ) and KTD(λ) successfully estimate the optimal state value V ∗.

4.2 Linear Case - Robustness Assessment

In this section, we want to observe the role of the cost function in the adaptation

process. In the following experiment, we consider the same Boyan chain from the

previous linear case (Figure 4-1), but unlike the above case, the rewards are random

57

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Trial number

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

KTDTD

Figure 4-5. Learning curve of TD(λ) and KTD(λ). The solid line shows the mean RMSerror and the dashed line shows the standard deviations over 50 MonteCarlo runs.

950 960 970 980 990 1000−25

−20

−15

−10

−5

0

Trial Number

Est

imat

ed v

alue

func

tion

KTD

950 960 970 980 990 1000−25

−20

−15

−10

−5

0

Trial Number

TD

Figure 4-6. The comparison of state value V (x) in x ∈ X convergence between TD(λ)and KTD(λ). The solid line shows the optimal state values V ∗ and thedashed line shows the estimated state values ~V by TD(λ) (left) and KTD(λ)(right).

variables themselves. We refer to them as noisy rewards. Three types of noise will be

added to the original discrete reward values, and behaviors between TD and correntropy

TD will be compared.

First, Gaussian noise with probability density function

G(µ,σ2) = 1√2πσ2

exp

(−(x − µ)2

2σ2

)(4–4)

58

is added to the rewards. Here, the mean µ is set as zero, and different variance values

(σ2 = 0.2, 0.5, 1, 2, 10, 20, 50) are applied. From Figure 4-2, we observed that the

parameter set λ = 1 and η0 = 0.1 leads to the best performance. However, for fair

comparison with CTD, TD with λ = 0 and η0 = 0.3 will be applied. To observe the

influence of the Gaussian noise in the performance of the TD algorithm, we apply it for

two parameter sets (λ = 1 and η0 = 0.1 (red line) and λ = 0 and η0 = 0.3 (blue line) as

depicted in Figure 4-7). The RMS error is averaged over 50 Monte Carlo runs, and the

plot shows the mean and standard deviation at the 1000th trial.

10−1

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Variance of noise distribution

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

λ = 1, η

0 = 0.1

λ = 0, η0 = 0.3

Figure 4-7. The performance of TD for different levels (variances σ2) of additiveGaussian noise on the rewards.

In both parameter sets, increasing the noise variance worsens the performance

of TD in terms of the mean and standard deviation of the average RMS errors. Now,

we exam how CTD behaves with respect to the noise variance. Recall that correntropy

itself requires setting an extra kernel size that is different from the kernel size parameter

required in KTD. To distinguish the two kernels, we will refer to the kernel hc as the

correntropy kernel. Different correntropy kernel sizes (hc = 1, 2, 3, 4, 5, 10, 20, 50, 100)

are applied to CTD algorithm, and its performance is observed with respect to the noise

variances, σ2 = 0.2, 1, 10, 50 (Figure 4-8).

59

Smaller correntropy kernel sizes hc yield higher RMS error, but as the correntropy

kernel size gets larger, the error converges to values similar to those obtained with TD

for λ = 0 and η0 = 0.3 (The blue line in Figure 4-7). This result is intuitive since MSE

is optimal under Gaussian assumptions and for a large enough kernel size, correntropy

behaves similar to MSE. We further look into the learning behavior of TD and CTD when

the noise variance σ2 = 10 is taken (Figure 4-9). The RMS error is averaged over 50

Monte Carlo runs, and the mean and standard deviation at the 1000th trial are displayed.

As we show in Figure 4-7 and 4-8C, similar mean and variance for TD and CTD

can be observed. Nevertheless, CTD shows smoother learning curves than TD. This

is expected since correntropy behaves similar to MSE when hc → ∞. From the

comparison between the two different correntropy kernel sizes, hc = 5 and hc = 10,

we observe that the smaller correntropy kernel size has slower convergence rates. In

this experiment, we confirmed that MSE criteria is optimal in the case of Gaussian noise

with zero mean, and that CTD is also able to approximate the value function with proper

choice of kernel size hc . Note that Gaussian noise with different variances are added to

the assigned reward in Figure 4-8.

Secondly, we explore the behavior of TD and CTD under outlier noise conditions;

the mixture of Gaussian distributions, 0.9G(0, 1) + 0.1G(5, 1), is added to the reward

values. From Figure 4-2, we know that for TD(λ = 0), the initial stepsize η0 = 0.3 is

optimal. To find the optimal correntropy kernel size hc in CTD, we evaluate different

correntropy kernel sizes, hc = 1, 2, 2.5, 3, 5, 10, 20, 50, 100 (Figure 4-10).

Small correntropy kernel sizes, hc = 1 and hc = 2, lead to large RMS error.

In this case, the convergence can be very slow, and only in a small vicinity of the

optimal solution, the gradient has values that will make the adaptive system respond

accordingly. When hc = 2.5 correntropy TD shows the lowest RMS error, and as hc

increases, the average RMS increases and converges to similar results as TD. Again, it

is obvious that the large correntropy kernel size takes into account larger values of the

60

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Correntropy kernel size, hc

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

A σ2 = 0.2

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

B σ2 = 1

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

C σ2 = 10

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

D σ2 = 50

Figure 4-8. The performance change of CTD over different correntropy kernel sizes, hc .

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Trial Number

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

TDCTD with h

c = 5

CTD with hc = 10

Figure 4-9. Learning curve of TD and CTD when the Gaussian noise with varianceσ2 = 10 is added to the reward. RMS error is averaged over 50 Monte Carloruns, and the solid line shows the mean RMS error, and the dashed linerepresents the standard deviations.

error, especially those of the second component of the mixture. The learning curves of

TD and CTD are compared in Figure 4-11.

61

100

101

102

0

1

2

3

4

5

6


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

Figure 4-10. Performance of CTD corresponding to different correntropy kernel sizes hc ,with mixture of Gaussian noise distribution. The RMS error is averagedover 50 Monte Carlo runs, and the plot shows the mean and standarddeviation at the 1000th trial.

0 200 400 600 800 10000

0.5

1

1.5

2

2.5

3

Trial Number

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

TDCTD with h

c = 2.5

CTD with hc = 5

CTD with hc = 100

Figure 4-11. Learning curves of TD and CTD when the noise added to the rewardscorresponds to a mixture of Gaussians. RMS error is averaged over 50Monte Carlo runs, and the solid line is the mean RMS error, and thedashed line shows the standard deviation.

We can observe that as correntropy kernel size increases the performance

becomes similar to TD. Even though CTD with the optimal correntropy kernel size

hc initially performs slower than TD, we can observe that the error keeps decreasing

beyond the values obtained with TD. This is a clear example of the robustness of

correntropy to non-Gaussian non-symmetric impulsive noise.

62

It is well known that “heavy tail” distributions such as Laplacian makes the MSE

non-optimal [45, 47]. Thus, our third experiment considers Laplacian distributed additive

noise

L(µ, 2b2) = 1

2bexp

(−|x − µ|

b

)(4–5)

in the assigned reward. The mean µ is set as zero, and different variances (b2 =

0.04, 0.25, 1, 4, 25, 100) are applied. Again, TD with λ = 1 and η0 = 0.1 and λ = 0 and

η0 = 0.3 is applied to observe the influence of the Laplacian noise (Figure 4-12). In both

cases, the performance degrades as the variance increases. Moreover, the RMS values

obtained for Gaussian distributed noise with similar variances are smaller, which goes

along with the fact that MSE is suboptimal in this case.

10−2

10−1

100

101

102

103

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Variance of noise distribution

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

λ = 1, η

0 = 0.1

λ = 0, η0 = 0.3

Figure 4-12. Performance changes of TD with respect to different Laplacian noisevariances b2. The RMS error is averaged over 50 Monte Carlo runs, andthe plot shows the mean and standard deviation at the 1000th trial.

The performance of CTD is observed for different noise variances b2 = 0.04, 1, 25, 100.

The correntropy kernel sizes of hc = 1, 2, 3, 4, 5, 10, 20, 50, 100 are applied. Figure 4-13

shows the corresponding RMS values.

When the noise variance is small (b2 = 0.04 and b2 = 1), the results show

that performance does not degrade as the correntropy kernel size becomes larger

(Figure 4-13A and 4-13B). In this case, the two cost functions, MSE and MCC, do not

expose significant differences. However, when the noise variance is large (b2 = 25

63

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

A b2= 0.04

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

B b2= 1

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

C b2= 25

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

D b2= 100

Figure 4-13. Performance of CTD depending on different correntropy kernel sizes hcwith various Laplacian noise variances. The RMS error is averaged over 50Monte Carlo runs, and the plot shows the mean and standard deviation atthe 1000th trial.

and b2 = 100), certain correntropy kernel sizes show smaller error than other larger

correntropy kernel sizes (Figure 4-13C and 4-13D). Since MSE is not optimal under

’heavy tail’ noise distributions, approximating the behavior of correntropy to MSE

by increasing the kernel size results in worse performances. Figure 4-14 shows the

learning curve of TD and CTD when the Laplician Noise with variance b2 = 25 is added

to the reward.

At the beginning, TD shows a slightly faster convergence rate, but after around

the 50th trial, CTD reaches lower RMS error than TD. Again, this example verifies the

robustness of correntropy for non-Gaussian scenarios; in particular, heavy tail distributed

noise (sparse).

64

0 200 400 600 800 10000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Trial Number

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

TDCTD with h

c = 10

Figure 4-14. Learning curve of TD and CTD when the Laplacian noise with varianceb2 = 25 is added to the reward. RMS error is averaged over 50 Monte Carloruns, and the solid line is the mean RMS error, and the dashed line showsthe standard deviations.

4.3 Nonlinear Case

We have seen the performances of TD(λ), KTD(λ), and CTD on the problem

of estimating a state value function, which is a linear function of the given state

representation. Now, the same problem can be turned into a nonlinear one by modifying

the reward values in the chain such that the resulting state value function V ∗ is no longer

a linear function of the states.

The number of states and the state representations remain the same as in the

previous section. However, the optimal value function V ∗ becomes nonlinear with

respect to the representation of the states; namely, (V ∗ = [0− 0.2− 0.6− 1.4− 3− 6.2−

12.6− 13.4− 13.5− 14.45− 15.975− 19.2125− 25.5938]) for states 0 to 12. This implies

that the values of rewards for each state are also different from the ones given for the

linear case (Figure 4-15).

Again, to evaluate the performance, after each trial is completed, the estimated

state value ~V is compared to the optimal state value V ∗ using RMS error (4–1) as

described above for the linear case.

65

Figure 4-15. A 13 state Markov chain. In states from 2 to 12, each state transition haveprobability 0.5, and state 1 has transition probability 1 to the absorbing state0. Note that optimal state value functions can be represented as anonlinear function of the states, and corresponding reward values areassigned to each state.

First of all, TD(λ) is applied to estimate the value function with various combinations

of λ and initial step size η0. λ from 0 to 1 with 0.2 intervals and initial stepsizes η0

between 0.1 and 0.9 with 0.1 intervals are observed for 1000 trials. The RMS error of the

value function is the average over 10 Monte Carlo runs, and the initial weight vector is

set as w = 0 at each run (Figure 4-16).

−0.2 0 0.2 0.4 0.6 0.8 1 1.21.7

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6

a0

= 0.1

a

a

a

a

a

a

a

a

Figure 4-16. The effect of λ and the initial step size η0 in TD(λ). The plotted linesegments contain the mean RMS value after 100 trials (top marker), 500trials (middle marker), and 1000 trials (bottom marker).

It is noticeable that larger λ values show better performances, as we know this case

corresponds to the least mean squares solution. The behavior for intermediate cases

(λ < 1) is not guaranteed to converge to the optimal solution since the representation

of all states do not form a linearly independent set of vectors. However, the solution

for λ = 1 will still try to approximate E [d |x ] because of the implicit regularization in the

66

stochastic gradient algorithm. For further observation, TD with λ = 0.8 and η0 = 0.1 will

be applied.

For KTD(λ), the Gaussian kernel (4–3) is applied, and kernel size h = 0.2 is chosen

based on Figure 4-17; after fixing all the other free parameters around median values

λ = 0.4 and η0 = 0.5, the average RMS error for 10 Monte Carlo runs is compared.

Then, performances with different combinations of parameters (λ and η0) are compared

with h = 0.2. Figure 4-18 shows the average RMS error over 10 Monte Carlo runs for

1000 trials.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Kernel size, h

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

Figure 4-17. The performance of KTD with different kernel sizes. The plotted linesegment contains the mean RMS value after 100 trials (top marker), 500trials (middle marker), and 1000 trials (bottom marker).

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

η0

= 0.1

η

η

η

η

η

η

η

η

Figure 4-18. Performance comparison over different combinations of λ and the initialstepsize η in KTD(λ) with h = 0.2. The plotted segment is the mean RMSvalue after 100 trials (top segment), 500 trials (middle segment), and 1000trials (bottom segment).

67

Again, compared to TD, larger stepsizes with smaller λ values perform better in

KTD. The combination of λ = 0.4 and η0 = 0.3 shows the best performance, but

the λ = 0 case also shows good performances. Unlike TD, we can say that there is

no dominant value for λ. Recall that it has been proved that convergence to [d |x ] is

guaranteed for linearly independent representations of the states, which is automatically

fulfilled in KTD when the kernel is universal. Therefore, the differences are rather due

to the convergence speed controlled by the interaction between the step size and the

elegibilty trace. Based on Figure 4-16 and 4-18, optimal step size and eligibility trace

rate values are selected (λ = 0.8 and η0 = 0.1 for TD and λ = 0.4 and η0 = 0.3 for KTD),

and their respective average RMS error over 50 Monte Carlo runs are shown in Figure

4-19.

0 200 400 600 800 10000

0.5

1

1.5

2

2.5

3

Trial number

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

KTDTD

Figure 4-19. Learning curves of TD(λ) and KTD(λ). The solid line shows the mean RMSerror, and the dashed line represents the standard deviation over 50 MonteCarlo runs.

The linear function approximator, TD(λ) (blue line), cannot estimate the optimal

state values, but KTD(λ) outperforms the linear algorithm; this behavior is expected

since the Gaussian kernel is universal. KTD(λ) reaches to the mean value around

0.07, and the mean value of TD(λ) is around 1.8. Figure 4-20 shows the optimal state

value V ∗, and the predicted state value ~V by TD(λ) and KTD(λ) for the last 50 trials.

Notice that TD(λ) tries to estimate the value function by a piece-wise evenly spaced

68

pattern. This is associated with the degrees of freedom of the representation space

(4-dimensional for the present case). In contrast, KTD(λ) is able to faithfully reproduce

the nonlinear behavior of the value function.

950 960 970 980 990 1000

−25

−20

−15

−10

−5

0

Trial Number

Est

imat

ed v

alue

func

tion

KTD

950 960 970 980 990 1000

−25

−20

−15

−10

−5

0

Trial Number

TD

Figure 4-20. The comparison of state value convergence between TD(λ) and KTD(λ).The solid line shows the optimal state values V ∗ and the dashed line showsthe estimated state values ~V by TD(λ) (left) and KTD(λ) (right).

4.4 Nonlinear Case - Robustness Assessment

In this section, we extend the experiment to observe the performances of KTD(λ)

and CKTD under noisy rewards. We will consider the same Boyan chain from the

previous nonlinear case (Figure 4-15), and a noisy reward or perturbed policy will be

employed.

First of all, we add impulsive noise with a probability density function given by

0.95G(0, 0.05)+0.05(0, 5) to the current reward with probability 0.05. This can be thought

of randomly replacing the policy with probability 0.05. Since the state representations

and the optimal state values are the same as with the previous experiment, based on

Figure 4-17 and 4-18, a Gaussian kernel with kernel size h = 0.2 and initial stepsize

η0 = 0.3 with annealing factor a0 = 100 are applied. For fair comparison with correntropy

KTD, λ is set as 0. We validate the optimal correntropy kernel size based on Figure

4-21. We fix all the other free parameters around median values, λ = 0.4 and η0 = 0.5,

and the average RMS errors over 10 Monte Carlo runs are compared.

69

100

101

102

100


RM

S e

rror

of v

alue

func

tion

over

all

stat

es

Figure 4-21. Performances of CKTD depending on the different correntropy kernel sizes.The plotted line segments contain the mean RMS values after 100 trials(top markers), 500 trials (middle markers), and 1000 trials (bottommarkers). Note the log scales on x and y axis.

We can observe that as the correntropy kernel size gets larger, the 100th mean

RMS error decreases, and we know it converges to the same solution as KTD. However,

after 1000 trials, CKTD with hc = 5 shows the lowest mean RMS error. This implies

that a larger correntropy kernel size brings faster initial convergence speeds, but it fails

to reach lower errors if the correntropy kernel size remains too large. This motivates

the idea that by controlling the correntropy kernel size during the adaptation, we may

obtain fast and robust function approximation. In Figure 4-22, we compare the learning

curves of KTD and CKTD with different correntropy kernel sizes, and Figure 4-23 shows

the estimated state value ~V by KTD and CKTD. In the case of KTD, it is noticeable that

when the undesirable noisy transition occurs the estimation process degrades, and

thus the overall performance is affected. On the other hand, CKTD shows more stable

performance even with the impulsive noise. For CKTD, we apply a fixed correntropy

kernel size hc = 5 (blue line); as expected, it shows a slower convergence rate than KTD

(red line) but lower error values. To obtain faster convergence, we start with hc = 150,

and the correntropy kernel size is switched to hc = 5 at 100th trial. In this way, we can

accelerate initial convergence rates and, after switching, lower error values. A similar

70

switching scheme has already been utilized in correntropy based adaptive filtering

algorithms, but instaead of applying a large initial kernel size the algorithm uses MSE at

the initial stage and then switches to correntropy.

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Trial number

RM

S e

rror

of v

alue

func

tion

over

all

stat

es

KTDCKTD h

c = 5

CKTD with changing hc

Figure 4-22. Learning curve of KTD and CKTD. The solid line is mean RMS error over50 Monte Carlo runs and the dashed line shows the standard deviation.

960 980 1000

−25

−20

−15

−10

−5

0

Trial Number

Est

imat

ed v

alue

func

tion

KTD

960 980 1000

−25

−20

−15

−10

−5

0

Trial Number

CKTD hc = 5

960 980 1000

−25

−20

−15

−10

−5

0

Trial Number

CKTD with changing hc

Figure 4-23. The comparison of state value function ~V estimated by KTD andCorrentropy KTD.

Now, we want to see how perturbed policy influences the performance of KTD

and CKTD. We consider the same Boyan chain from the previous nonlinear case

(Figure 4-15), but the observations come from a policy that has been perturbed. Note

that in this experiment, reward does not contains any noise; noise is only added to

state transitions. At each step, with probability 0.1, the state transitions are made from

71

state i to j uniformly for j = 0, ... , 12, and a reward value correspond to the state xi is

assigned. Since the state representations and the optimal state values are the same

as the previous experiment, a Gaussian kernel with kernel size h = 0.2, initial stepsize

η0 = 0.3 with annealing factor a0 = 100, and λ = 0 are applied. KTD and CKTD are

trained for 2000 trials and the RMS error is averaged over 100 Monte Carlo runs. Figure

4-24 shows the mean and standard deviation of error norm over 100 Monte Carlo at the

2000th trial when CKTD is used with correntropy kernel sizes hc between 1 and 10 with

increments of 1 and also for hc = 15, 20, 100. Figure 4-25 shows the learning curves

of KTD and CKTD with correntropy kernel sizes hc = 2, 3, 5, 6, 7, 100 in terms of mean

RMS error over 100 Monte Carlo runs.

Again, we observe that a larger correntropy kernel size brings faster initial

convergence speeds, but it fails to reach lower errors if the correntropy kernel size

remains too large, showing similar performance to KTD. This experiment provides

evidence that by controlling the kernel size, we can obtain fast and robust function

approximation. Figure 4-24 and 4-25 show that correntropy kernel size hc = 100

performs the same as KTD; in the case of KTD, the mean and standard deviation at

the last trial are 1.7016 and 0.4069, respectively. Although hc = 5 and hc = 6 have

the same interval for the error value in the last trial, since a larger kernel size is faster,

hc = 6 is selected as the optimal correntropy kernel size. In the case of KTD, we observe

that when the undesirable noisy transition occurs the estimation process degrades, and

thus the overall performance is affected. On the other hand, CKTD shows more stable

performance even under the random state transitions.

In this chapter, we examine the behavior of the algorithms introduced in the

previous chapter. We present experimental results on synthetic examples to approximate

the state value function under a fixed policy. In particular, we apply the algorithms

to absorbing Markov chains. We observe that KTD(λ) performs well on both linear

and nonlinear function approximation problems. In addition, we show how the linear

72

100

101

102

0

1

2

3

4

5

6

7

Correntropy Kernel Size, hc

RM

S E

rror

of v

alue

func

tion

over

all

stat

es

Figure 4-24. Mean and standard deviation of RMS error over 100 runs at the 2000th trial.Note the log scale in x-axis.

100

101

102

103

0

1

2

3

4

5

6

7

8

Trial Number

RM

S e

rror

of valu

e function o

ver

all

sta

tes

KTDCKTD : h

c = 2

CKTD : hc = 3

CKTD : hc = 5

CKTD : hc = 6

CKTD : hc = 7

CKTD : hc = 100

Figure 4-25. Mean RMS error over 100 runs. Notice this is a log plot in the horizontalaxis

independence of the input state representations can affect the performance of

algorithms. This is an essential guarantee for the convergence of TD with eligibility

traces. The use of strictly positive definite kernels in KTD(λ) implies the linear independence

condition, and thus this algorithm converges for all λ ∈ [0, 1]. Moreover, we perform

experiments with the maximum correntropy criterion under noisy conditions. Experiments

with heavy tail distributions on noisy rewards and state transition probabilities show that

CTD and CKTD algorithms can improve performance over conventional MSE. In

particular, robust behavior of correntropy is tested for Laplacian noise and impulsive

73

noise that represents the effects of outliers in the reward. Correntropy was also

tested when the policy is randomly replaced; this was achieved by adding a random

perturbation to state transitions. In the following chapters, we will extend the TD

algorithms to estimate the action value function which can be applied in finding a

proper control policy.

74

CHAPTER 5POLICY IMPROVEMENT

We have shown how the kernel based nonlinear mapping and TD(λ) can be

combined in kernel based least mean squares temporal difference learning with

eligibilities called KTD(λ), and we have seen the advantages of KTD(λ) in nonlinear

function approximation problems. Moreover, a new robust cost function based on

correntropy has been integrated into TD and KTD algorithms.

So far, we have only used TD learning algorithms to estimate the state value

function given a fixed policy. However, this is still an intermediate step in RL. Recall that

we want to find a proper state to action mapping that results in maximum return. Since

the value function quantifies the relative desirability of different states in the state space,

it allows comparisons between policies and thus guides the optimal policy search.

Therefore, we can extend the proposed methods to solve complete RL problems.

Here, our goal is to find the optimal control action A(n) at each time n which

maximizes the cumulative reward. When the optimal state value function V ∗ is obtained

(2–12), an optimal policy π∗ can be derived using the state value function; the optimal

action sequence {A(n)} is given by

A(n) = arg maxa∈A(x)

∑x ′

Paxx ′ [Ra

xx ′ + γV ∗(x ′)] . (5–1)

Here, for the sake of simplicity, we denote x(n) by x . However, direct use of (5–1) is

still limited because in practice, Paxx ′ or Ra

xx ′ are unknown most of the time. One way to

get around these issues is with Q-learning [55]. Q-learning allows the estimation of the

optimal value function Q∗(x , a) incrementally, and based on the estimated Q, a proper

policy can be obtained.

From the definition of state-action value functions, we have the following relation:

V ∗(x) = maxa∈A(x)Qπ∗(x , a). This shows that the optimal action process (5–1) can be

75

obtained using the action value function Q

A(n) = arg maxa∈A(x)

Q∗(X (n), a), (5–2)

where {X (n)} is the controlled Markov chain [32].

5.1 State-Action-Reward-State-Action

The first step to apply Q-learning is to estimate the state-action value function Q

instead of the state value function V . State-Action-Reward-State-Action(SARSA) is

introduced to learn the state-action value function Q given a fixed policy. The update

rule of SARSA is as follows,

Q(x(n), a(n))← Q(x(n), a(n))+η[r(n+1)+γQ(x(n+1), a(n+1))−Q(x(n), a(n))]. (5–3)

This shows that to complete one update, the sequence of state-action pair (x(n), a(n)),

corresponding reward r(n+1), and transition to next state-action pair (x(n+1), a(n+1))

are required, and thus the name SARSA.

SARSA has a strong relation with Q-learning. It can be understood as Q-learning

[55] given a fixed policy [52]. Q-learning does not use a fixed policy, but it explores

different policies to ultimately obtain a good policy. For large state X and action A

spaces, we can estimate the Q values using function approximators, but now the

proposed TD(λ) algorithms are applied to state action pairs rather than only to states

[48]. This gives the basic idea of how the TD algorithms can be associated with Q

function approximation in policy evaluation.

5.2 Q-learning

Q-learning is a well known off-policy TD control algorithm. The form of the

state-action mapping function (policy) is undetermined, and TD learning is applied

to estimate the state-action value function. This allows the system to explore policies

towards finding an optimal policy. This is an important feature for practical applications

since prior information about a policy is usually not available.

76

Since value functions represent the expected cumulative reward given a policy,

we can say that the policy π is better than the policy π′ when the policy π gives greater

expected return than the policy π′. In other words, π ≥ π′ if and only if Qπ(x , a) ≥

Qπ′(x , a) for all x ∈ X and a ∈ A. Therefore, the optimal action value function Q can be

written as follows,

Q∗(x(n), a(n)) = maxπ

Qπ(x(n), a(n)) (5–4)

= E

[r(n + 1) + γ max

a(n+1)Q∗(x(n + 1), a(n + 1))

∣∣∣∣ x(n), a(n)] (5–5)

The equation (5–5) can be estimated online, and a one-step Q-learning update can be

defined as,

Q(x(n), a(n))← Q(x(n), a(n))+η[r(n+1)+γmaxa

Q(x(n+1), a)−Q(x(n), a(n))], (5–6)

to maximize the expected reward E [r(n + 1)| x(n), a(n), x(n + 1)]. At time n, an action

a(n) can be selected using methods such as ϵ-greedy or the Boltzmann distribution,

which are commonly applied [53].

In the case that the state X and action A sets are finite, (5–6) allows explicitly

computing the action value function Q. However, when the state X and action A are

infinite or very large, it is infeasible to obtain explicit Q values. Thus, we will see how

functional approximation can be integrated into Q-learning.

5.3 Q-learning via Kernel Temporal Differences and Correntropy Variants

We have seen how temporal difference algorithms approximate the state value

functions using a parametrized family of functions. In Q-learning, the state-action

value function Q can be approximated using the proposed methods (KTD(λ), CTD, and

CKTD).

This can be done using the same methods employed for the state value function

estimation. We previously approximated the state value function using a parametrized

family of functions such as ~V (x(n)) = f (x(n),w) using TD algorithms. We can apply the

77

same approach to approximate the state-action value function:

~Q(x , a = i) = f (x ,w |a = i). (5–7)

In the case of the linear function approximators (TD(λ) and CTD), the action value

function can be estimated as ~Q(x(n), a = i) = w⊤x(n), and for their kernel extensions

(KTD(λ) and CKTD), the action value function can be approximated as ~Q(x(n), a = i) =

⟨f ,ϕ(x(n))⟩. Note that ~Q(x(n), a = i) denotes an state-action value given a state x(n) at

time n and a discrete action i .

Therefore, based on Q-learning (5–6), the update rule for KTD(λ) (3–39) can be

integrated as

f ← f + ηm∑n=1

[r(n + 1) + γmaxa

Q(x(n + 1), a)−Q(x(n), a(n))]

n∑k=1

λn−kϕ(x(k)). (5–8)

We call this approach Q-learning via kernel temporal difference (Q-KTD)(λ). For

single-step prediction problems (m = 1), (5–8) yields single updates for Q-KTD(λ) of the

form

Qi(x(n)) = η

n−1∑j=1

eTDi(j)Ik(j)κ⟨x(n), x(j)⟩. (5–9)

Here, Qi(x(n)) = Q(x(n), a = i) and eTDi(n) denotes the TD error defined as

eTDi(n) = ri + γQii(x(n + 1))−Qi(x(n)), (5–10)

and Ik(n) is an indicator vector with the same size as the number of outputs (actions).

This means that only the k th entry of the vector is set to 1 and the rest of the entries

are 0. The selection of the action unit k at time n can be based on a greedy method.

Therefore, only the weight (parameter vector) corresponding to the winning action gets

updated. Recall that the reward ri corresponds to the action selected by the current

policy with input x(n) because it is assumed that this action causes the next input state

x(n + 1).

78

The selection of the action unit k at time n is based on methods such as ϵ-greedy

and the Boltzmann distribution which are commonly applied for the action selection [53].

We adopt ϵ-greedy for our experiments. This is one of the most popular methods to

control the exploration and exploitation trade off. The action corresponding to the unit

with the highest Q value gets selected with probability 1− ϵ. Otherwise, any other action

is selected at random. In other words, the probability of selecting a random action is ϵ.

The structure of Q-learning based on KTD(0) is shown in Figure 5-1. The number

Figure 5-1. The structure of Q-learning via kernel temporal difference (λ)

of units (kernel evaluations) increases as more training data arrives. Each added unit is

centered at the previous input locations x(1), x(2), · · · , x(n − 1).

Likewise, Q-learning via correntropy temporal difference (Q-CTD) have the following

update rule

w ← w + η

m∑n=1

[exp

(−eTD(n)2

2h2c

)eTD(n)x(n)

], (5–11)

and Q-CKTD as

f ← f + η

m∑n=1

[exp

(−eTD(n)2

2h2c

)eTD(n)ϕ(x(n))

]. (5–12)

Here, the temporal difference error eTD is defined as

eTD(n) = r(n + 1) + γmaxa

Q(x(n + 1), a)−Q(x(n), a(n)). (5–13)

79

5.4 Reinforcement Learning Brain Machine Interface Based on Q-learning withFunction Approximation

We have seen how the agent and environment interact in the reinforcement learning

paradigm in Figure 2-1. Moreover in Figure 1-1, we have shown how the environment

can be conceived of the reinforcement learning brain machine interface (RLBMI)

paradigm. The TD algorithms we proposed help model the agent. Figure 3-1 shows how

state value function V can be estimated using the proposed TD algorithms under a fixed

policy. Note that the state value function approximation is only an intermediate step and

the form of policy is fixed.

In RLBMI, it is essential to find the policy which conveys the desired action on the

external device. Direct computation of the optimal policy is challenging since all the

information required to calculate the optimal policy is not known in practice. Therefore,

we estimate the optimal policy using the action value function Q. Figure 5-2 depicts the

RLBMI structure using Q-learning with the proposed TD algorithms.

Figure 5-2. The decoding structure of reinforcment learning model in a brain machineinterface using a Q-learning based function approximation algorithm.

Based on the neural state from environment, the action value function Q can be

approximated using an adaptive system. We proposed algorithms focusing on both

the functional mapping and the cost function. Kernel based representations have been

integrated to improve the functional mapping capabilities of the system, and correntropy

has been employed as the cost function to obtain robustness in the system. Based on

80

the estimated Q values, a policy decides a proper action. Note that the policy is the

learning policy which changes over time.

Recall that the main advantage of RLBMI is the co-adaptation between two

intelligent systems: the BMI decoder in the agent, and the BMI user in the environment.

Both systems learn how to earn rewards based on their joint behavior. The BMI decoder

learns a control strategy based on the user’s neural state and perform actions in

goal directed tasks that update the state of the external device in the environment. In

addition, the user learns the task based on the state of the external device. Both the BMI

decoder and the user receive feedback after each movement is completed and use this

feedback to adapt. Notice that both systems act symbiotically by sharing the external

device to complete their tasks, and this co-adaptation allows for continuous synergistic

adjustments of the BMI decoder and the user even in changing environments. In

Chapter 7, we will examine how this co-adaptation process works in practice by showing

experiments on real BMIs.

81

CHAPTER 6SIMULATIONS - POLICY IMPROVEMENT

In this chapter, we examine the empirical performances of the extended temporal

difference algorithms to the problem of finding a proper state to action mapping based

on the estimated action value function Q. In the following, we will not only assess their

performance and behavior but also examine the methods’s applicability to practical

situations. Note that in the following simulations, the block diagram of the agent remains

the same as in Figure 5-2; nonetheless, the components in the environment block are

indeed different. For instance, in the following mountain car problem, the states are

position and velocity, and the actions are the left and right accelerations as well as

coast.

6.1 Mountain Car Task

We first carry out experiments on a simple dynamic system which was first

introduced in [34]. This experiment is well known as “Mountain-car task,” a famous

episodic task in control problems. There is a car driving along a mountain track as

depicted in Figure 6-1, and the goal of this task is to reach the top of the right side

hill. The challenge in this task is that there are regions near the center of the hill

where maximum acceleration of the car is not enough to overcome the force imposed

by gravity, and therefore a more sophisticated strategy that allows the car to gain

momentum using the hill must be learned. Thus, if the system simply tries to maximize

short term rewards, it would fail to reach the goal. In this case, the only way to reach the

goal is to first accelerate backwards, even though it is further away from the goal, and

then drive forward with full acceleration. This is a representative example to evaluate the

system’s capability to find a proper policy to achieve a goal in RL.

The details of the model are based on [48]. The observed states correspond to the

following pair of continuous variables are position p(n) and velocity v(n) of the car. The

values are restricted to the intervals −1.2 ≤ p(n) ≤ 0.5 and −0.07 ≤ v(n) ≤ 0.07 for all

82

Figure 6-1. The Mountain-car task.

time n. The mountain altitude is sin(3p), and the state evolution dynamics are given by

v(n + 1) = v(n) + 0.001a(n)− g cos(3p(n)) (6–1)

p(n + 1) = p(n) + v(n + 1) (6–2)

where g represents gravity (g = 0.0025), and a(n) is a chosen action at time n. There

are 3 possible actions: accelerate backwards a = −1, coast a = 0, and accelerate

forward a = +1. At each time step, reward r = −1 is assigned, and once the updated

position p(n + 1) exceeds 0.5, the trial terminates.

We undergo 30 trials to learn the policy. At each trial, the initial states are drawn

randomly from −1.2 ≤ p ≤ 0.5 and −0.07 ≤ v ≤ 0.07. The system is initialized when the

first trial starts, and each trial has the maximum number of steps as 104. At each trial,

the number of steps is counted, and it is averaged over the 30 trials and 50 Monte Carlo

runs. For each Monte Carlo run, the same set of 30 initial values is used. In addition, for

the ϵ-greedy method, we apply exploration rate ϵ = 0.05.

First, we apply Q-TD(λ) to find the state action map, and the performances of

different combinations of parameters (λ = 0, 0.2, 0.4, 0.6, 0.8, 1 and η = 0.1, 0.3, 0.5, 0.7, 0.9)

are observed (Figure 6-2). In general, as λ gets larger, the performance worsens.

The large mean and standard deviation appear when the car gets stuck in the valley

83

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2000

0

2000

4000

6000

8000

10000

12000

η=0.1

η=

η=

η=

η=

Figure 6-2. Performance of Q-TD(λ) with various combination of λ and η.

(p ≈ −0.5), so it fails to reach the goal within the maximum step limit, 104. Note that

in this task, the input state space is continuous, so the there are an infinite number of

states, and using the position-velocity representation certainly does not fulfill the linear

independence criterion.

Attempts to make Q-TD(λ) applicable in continuous input space by discretizing

the state space are usually considered. For instance, placing overlapping tiles to

partition the input space, a process called “tile coding,” is a usual approach to provide a

representation that would be expected to do a better job. Examples where we can see

the performance of TD including this preprocessing method can be found in [16, 48].

However, proper state representations are difficult to obtain because they require prior

information about the state space. It is here where we believe Q-KTD(λ) can provide an

advantage.

For Q-KTD(λ), we employ the Gaussian kernel (4–3). From the Q-TD(λ) application,

it is observed that kernel size h = 0.2 which is close to the heuristic that uses the

distribution of squared distance between pairs of input states. To confirm the usefulness

of these values, we apply different kernel sizes (h = 0.01, 0.05, 0.1, 0.2, 0.3, 0.4), and the

mean number of steps per trial is observed. This mean is the average over 30 trials and

50 Monte Carlo runs. For this evaluation, we fix λ = 0.4 and η = 0.5.

Kernel size h = 0.05 shows the lowest mean number of steps per trial, but

performances are not significantly different for a broader range of parameter values

that include h = 0.2, which is the largest kernel size that exposes good performance.

84

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4150

200

250

300

350

400

450

Kernel size, h

Ave

rage

num

ber

of s

teps

per

tria

l

Figure 6-3. The performance of Q-KTD(λ) with respect to different kernel sizes.

Again, a preference for a larger kernel size is motivated by the smoothness assumption.

The performance of Q-KTD(λ) with a different combination of λ and η are observed.

Here, the same combination as Figure 6-2 is tested.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2000

0

2000

4000

6000

8000

10000

12000

η=0.1

η=

η=

η=

η=

Figure 6-4. Performance of Q-KTD(λ) with various combination of λ and η.

Based on Figure 6-2, 6-3, and 6-4, the optimal parameters for Q-TD and Q-KTD can

be obtained (λ = 0.4 and η = 0.5 for Q-TD and λ = 0, η = 0.3, and h = 0.2 for Q-KTD).

With the selected parameters, we further compare the performances of Q-TD(λ) and

Q-KTD(λ). First, at each trial, we count the number of iterations until the car reach the

goal, and then we average the number of iterations per trial over 30 trials and 50 Monte

Carlo runs. Figure 6-5 shows the relative frequency with respect to the average number

of iterations per trial. For better understanding, Figure 6-6 plots the average number

of iterations per trial with respect to the trial number. Note that the x-axis of Figure

6-5 corresponds to y-axis of Figure 6-6. The results shows that both Q-TD(λ) and

85

Figure 6-5. Relative frequency with respect to average number of iterations per trial ofQ-TD(λ) and Q-KTD(λ).

Figure 6-6. Average number of iterations per trial of Q-TD(λ) and Q-KTD(λ).

Q-KTD(λ) are able to find a proper policy. However, compared to Q-TD(λ), Q-KTD(λ)

works better for policy improvement. Q-KTD(λ) has more trials with less number of

iterations (Figure 6-6). In addition, the large number of iterations in Figure 6-5 is due to

exploration at the initial stage of learning.

In the state value estimation problems (Boyan chain experiments in the previous

chapters), we have seen the robustness of maximum correntropy criteria (MCC) under

different types of perturbation on the policy or environment; namely, the noisy reward

and state transition probability. Here, we will see the usefulness of correntropy for

learning under switching policies, which can be the case in implementing an exploration

86

/ exploitation trade off in reinforcement learning. TD algorithms integrated with

correntropy can provide better performance under such learning scenarios.

When we try to obtain a good policy without any prior knowledge of how the optimal

policy should be, the system is required to learn by exploring the environment. Thus,

at each time, the system observes certain state to action maps from experience, and

the system needs to evaluate the given policy to update the functional mapping; that is,

it is essential that the system is able to learn under changing policies. Therefore, here

we will observe how the proposed algorithms can efficiently learn a good policy while

constantly changing policies during the learning process.

We use the Mountain-car task and vary the exploration rate to confirm how the

system learns under changing policy. We start with a totally random policy (100%

exploration rate, ϵ = 1). This exploration rate is kept until 200th step and then we switch

to ϵ = 0. When the exploration rate ϵ is 0, the observed performance shows exactly what

the system has been able to learn from random exploration. In addition, further steps

are allowed to let the system adjust its current estimate of the policy.

By keeping the optimal parameters of Q-KTD η = 0.3 and h = 0.2, we examine

Q-CKTD with different correntropy kernel sizes (hc = 1, 2, 3, 4, 5, 10, 50). Figure 6-7

shows the average number of steps per trial over the 30 trials and 50 Monte Carlo

runs. In the case of hc = 3, Q-CKTD results shows a mean and standard deviation

of 349.8713 ± 368.0790, whereas Q-KTD shows 558.5773 ± 1012.3 with the optimal

parameters. This observation reveals the positive effect that robustness of correntropy

as a cost function brings to learning under changing policies. For better understanding,

we further observe the average step number at each trial over 50 Monte Carlo runs

(Figure 6-8). Note that the same 30 initial states are applied for the 50 Monte Carlo runs.

Q-CKTD takes a larger number of steps at the beginning, but as learning progresses

(trial number increases), it requires a significantly fewer steps per trial. We can also see

that the system adapts to the environment and is able to find a better policy. Note that

87

100

101

340

360

380

400

420

440

460

480

500

Ave

rage

num

ber

of s

teps

per

tria

l


Figure 6-7. The performance of Q-CKTD with different correntropy kernel sizes.

0 5 10 15 20 25 300

200

400

600

800

1000

1200

1400

Trial number

Ave

rage

num

ber

of s

teps

per

tria

l

Q−KTDQ−CKTD

Figure 6-8. Average number of steps per trial of Q-KTD and Q-CKTD.

until the 200th step, the policy is completely random, and thus, both Q-KTD and Q-CKTD

show an average number of steps larger than 200. The trials that reach the goal even

under a random policy are able to do so because their initial positions are close enough

to the goal.

6.2 Two Dimensional Spatial Navigation Task

We have observed the benefits of using the kernel base representations in practical

applications. Before applying Q-KTD(λ) and Q-CKTD to neural decoding in brain

machine interfaces, we present some results on a simulated, 2-dimensional spatial

navigation task. This simulation will provide insights about how the system will perform

88

in further practical experiments. This simulation shares some similarities with the

neural decoding experiment; based on the input states, the system predicts which

direction should follow, and depending on the updated position, the next input states

are provided. The goal is to reach a target area where a positive reward is assigned.

No prior information of the environment is given; the system is required to explore the

environment to reach the target.

This simulation is a modified version of the maze problem in [14]. In our case there

is a 2-dimensional state space that corresponds to a square with a side-length of 20

units. The goal is to navigate from any position on the square to a target located within

the square. In our experiments, one target is located at the center of the square (10, 10)

and any approximations within a 2 unit radius are considered successful. A fixed set

of 25 points distributed in a lattice configuration are taken as initial seeds for random

initial states. Each initial state corresponds to drawing randomly one of these 25 points

with equal probability. The location of the selected point is further perturbed with unit

variance, zero mean additive Gaussian noise, G(0, 1). To navigate through the state

space, we can choose to move 3 units of length in one of the 8 possible directions that

are allowed. The maximum number of steps per trial is limited to 20.

The agent gets a reward +0.6 every time it reaches the goal, and then a new trial

starts. Otherwise, a reward −0.6 is given. Exploration rate of ϵ = 0.05 and discount

factor γ = 0.9 are used. The kernel employed is the Gaussian kernel with size h = 4.

This kernel size is selected based on the distribution of squared distance between pairs

of input states.

To assess the performance, we count the number of trials which earned the positive

reward within a group of 25 trials; that is, every 25 trials, we calculate the success rate

of the learned mapping as (# of successful trials)/25. To help in understanding the

behavior and illustrating the role of the parameters, with a fixed kernel size of 4, the

89

performance over the various stepsizes (η = 0.01, 0.1 ∼ 0.9 with 0.1 intervals) and

values of eligibility trace rate (λ = 0, 0.2, 0.5, 0.8, 1) are shown in Figure 6-9.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Stepsizes

Suc

cess

Rat

es

lambda = 0lambda = 0.2lambda = 0.5lambda = 0.8lambda = 1

Figure 6-9. The average success rates over 125 trials and 50 implementations.

The stepsize η mainly affects the speed of learning, and within the stability limits,

larger stepsizes provide faster convergence. However, due to the effect of eligibility trace

rate λ, the stability limits suggested in [28] must be adjusted accordingly

η <N

tr [Gϕ]=

N∑N

j=1 κ(x(j), x(j))≤ N∑N

j=1(λ0 + · · ·+ λm−1)κ(x(j), x(j))

. (6–3)

This upper bound assumes that a maximum number of steps per trial m has been

reached, and in the case of the Gaussian kernel, the bound becomes 1/(λ0+ · · ·+λm−1).

Hence, for larger λ values, the stable stepsizes η lie in a smaller interval; for λ =

0, 0.2, 0.5, 0.8, 1, the stable stepsizes η lie below 1, 0.8, 0.5, 0.2, 0.05, respectively.

The trade-off between λ and η is observed in Figure 6-9. It is clear how these

parameters can be associated with the speed of learning. At intermediate points on the

horizontal axis, the influence of λ becomes relevant as the stepsize decreases. On the

other hand, if the stepsize increases, we can see how performance degrades for larger

values of λ since the bound set by (6–3) is not satisfied.

The relation between final filter size versus stepsize and eligibility trace rate is

also plotted in Figure 6-10. Since each trial allows a maximum number of 20 steps, the

90

largest final filter size is 2500. However, with a good adaptive system, the final filter size

can be reduced. The final filter size corresponds inversely with the success rates (Figure

6-9 and 6-10). High success rates mean that a system has learned the state-action

mapping, whereas a system that has not adapted to new environment, keeps exploring

the space. Therefore, high success rates will correspond to small filter sizes and vice

versa.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91100

1200

1300

1400

1500

1600

1700

1800

1900

2000

2100

Stepsizes

Fin

al F

ilter

Siz

es

lambda = 0lambda = 0.2lambda = 0.5lambda = 0.8lambda = 1

Figure 6-10. The average final filter sizes over 125 trials and 50 implementations.

Both average success rate and final filter size show that η = 0.9 and λ = 0 have

the best performance. With the selected parameters, the success rates approach

to over 95% after 100 trials. From Figure 6-11, we can observe how learning is

accomplished. At the beginning, the system explores more of the space based on

the reward information, and the trajectories looks rather erratic. Once the system starts

learning, actions corresponding to states near the target point toward the reward zone,

and as time goes by this area becomes larger and larger until it covers the whole state

space.

The blue starts represent the 25 initial states, and green arrows shows the action

chosen at each state. Red dot at the center is the target and red circle shows the reward

zone.

91

Figure 6-11. Two dimensional state transitions of the first, third, and fifth sets withη = 0.9 and λ = 0.

Kernel methods are powerful for solving nonlinear problems, but the growing

computational complexity and memory size limit their applicability in practical scenarios.

To overcome this, we also show how the quantization approach presented [9] can be

employed to ameliorate the limitations imposed by growing filter sizes. For a fixed set

of 125 inputs, we consider quantization sizes ϵU = 40, 30, 20, 10, 5, 2, 1. Figure 6-12

shows the effect of different quantization sizes on the final performance. Notice that the

minimum size for stable performance of the filter is reached around approximately 60

units. Therefore, the quantization size ϵU can be selected, and the maximum success

rate is still being achieved (see Figure 6-13).

0 50 100 150 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Final Filter Sizes

Suc

cess

Rat

es

Figure 6-12. The average success rates over 125 trials and 50 implementations withrespect to different filter sizes.

92

0 20 40 60 80 100 1200

0.5

1

The Number of Trials

Suc

cess

Rat

es0 20 40 60 80 100 120

0

20

40

60

80

The Number of Trials

Filt

er S

izes

Figure 6-13. The change of success rates (top) and final filter size (bottom) with ϵU = 5.

Let us now compare the performance between Q-KTD and Q-CKTD. Based on

Figure 6-9 and 6-10, we select λ = 0, η = 0.9, kernel size h = 4 for both algorithms.

In the case of Q-CKTD, the correntropy kernel size hc = 10 is selected by visual

inspection. Table 6-1 shows the average success rate of Q-KTD and Q-CKTD. Note that

the average values correspond to 50 Monte Carlo runs using 125 trials per run.

Table 6-1. The average success rate of Q-KTD and Q-CKTD.mean standard deviation

Q-KTD 0.7019 0.0674Q-CKTD 0.7248 0.0455

As we can seee, the Q-CKTD algorithm shows higher average success rates as

well as smaller variance among runs. Figure 6-14 depicts the evolution of the average

success rates along with their variance estimates across 50 Monte Carlo runs. Every

25 trials, we count the number of trials that earned positive reward within the 25 trial

intervals. Both algorithms, Q-KTD and Q-CKTD, show similar performance at the very

beginning. However, as the number of trials increase, Q-CKTD displays higher average

success rates than Q-KTD. These differences are more noticeable at the 50th and

100th trials. In addition, it is also important to highlight the behavior of the standard

deviation for Q-CKTD, which decreases much faster than Q-KTD as the number of trials

93

increases. These results show that the robustness of the correntropy criterion as a cost

function can help in learning the policy.

0 25 50 75 100 125 1500

0.2

0.4

0.6

0.8

1

Trial Numbers

Suc

cess

Rat

es

Q−KTDQ−CKTD

Figure 6-14. The change of average success rates by Q-KTD and Q-CKTD.

In this chapter, we tested Q-KTD(λ) and Q-CKTD on synthetic experiments to

find a good policy based on the approximation of the action value function Q in RL.

We saw that Q-KTD(λ) provided stable performance in continuous state spaces and

found good state to action mappings. In addition, we observed that the robust nature

of Q-CKTD helped improve performance under changing policies. Experimental results

also provided insights on how to perform parameter selection. In addition, we showed

how the quantization approach could be successfully applied to control the growing filter

size. The results showed that the method was able to find a good policies and to be

implemented in more realistic scenarios.

94

CHAPTER 7PRACTICAL IMPLEMENTATIONS

In the previous chapters we use the Boyan Chain problem to elucidate the

properties of the different proposed algorithms when estimating state value functions.

We observed both linear and nonlinear capabilities in KTD(λ). Given the appropriate

kernel size, KTD should be able to approximate both linear and nonlinear functions. In

addition, the Mountain-car and 2-dimensional spatial navigation experiments showed

the advantages of Q-KTD(λ) in continuous state spaces where the number of states

is essentially infinite. The use of kernels allows arbitrary input spaces and works with

little prior knowledge of policy. Q-KTD(λ) is a simple yet powerful algorithm to solve RL

problems. Our ultimate goal is to show that KTD(λ) can work in more realistic scenarios.

To illustrate this, we present a relevant signal processing application in brain machine

interfaces.

In our RLBMI experiments, we use monkeys’ neural signal to map an action

direction (computer cursor position / robot arm position). The agent starts at a naive

state, but the subject has been trained to receive rewards from the environment. Once

it reaches the assigned target, the system and the subject earn a reward, and the agent

updates its decoder of brain activity. Through iteration, the agent learns how to correctly

translate neural states into action-direction.

7.1 Open Loop Reinforcement Learning Brain Machine Interface: Q-KTD(λ)

We first apply the neural decoder on open loop RLBMI experiments; the algorithm

learns based on the monkey’s neural states to find a proper mapping to actions while

the monkey is conducting a goal reaching task. However, the output of the agent does

not directly change the state of the environment because this is done with pre-recorded

data. The external device is updated based only on the actual monkey’s physical

response. Thus, if the monkey conducts the task properly, the external device reaches

the goal. In this sense, we only consider the monkey’s neural state from successful trials

95

to train the agent. The goal of this experiment is to evaluate the system’s capability to

predict the proper state to action mapping based on the monkey’s neural states and to

assess the viability of further closed loop RLBMI experiments.

7.1.1 Environment

The data employed in these experiments is provided by SUNY Downstate Medical

Center. A female bonnet macaque is trained for a center-out reaching task allowing

8 action directions. After the subject attains about 80% success rate, micro-electrode

arrays are implanted in the motor cortex (M1). Animal surgery is performed under the

Institutional Animal Care and Use Committee (IACUC) regulations and assisted by the

Division of Laboratory Animal Resources (DLAT) at SUNY Downstate Medical Center. A

set of 185 units are obtained after sorting from 96 channels, and the firing times of these

units are the ones used for the neural decoding; the neural states are represented by the

firing rates on a 100ms window.

There is a set of 8 possible targets and 8 possible action directions. Every

trial starts at the center point, and the distance from the center to each target is

4cm; anything within a radius of 1cm from the target point is considered as a valid

reach(Figure 7-1).

Figure 7-1. The center-out reaching task for 8 targets.

7.1.2 Agent

In the agent, Q-learning via kernel temporal difference (Q-KTD)(λ) is applied to

neural decoding. After the neural states are preprocessed by normalizing their dynamic

range to lie between −1 and 1, they are input to the system. Based on the preprocessed

neural states, the system predicts which direction the computer cursor will be updated.

96

Each output unit represents one of the 8 possible directions, and among the 8 outputs

one action is selected by the ϵ-greedy method [56]. The performance is evaluated by

checking whether the updated position reaches the assigned target, and depending on

the updated position, a reward value will be assigned to the system.

7.1.3 Center-out Reaching Task - Single Step

First, we observe the behavior of the algorithms on a single step reaching task.

This means that rewards from the environment are received after a single step and one

action is performed by the agent per trial. The assignment of reward is based on the

1− 0 distance to the target, that is, dist(x , d) = 0 if x = d , and dist(x , d) = 1, otherwise.

Once the cursor reaches the assigned target, the agent gets a positive reward (+0.6),

otherwise it receives negative reward (−0.6) [41]. Based on the selected action with

exploration rate ϵ = 0.01, and the assigned reward value, the system is adapted as in

Q-learning via kernel TD(λ) with γ = 0.9. In our case, we can consider λ = 0 since our

experiment performs single step updates per trial.

In this experiment, the firing rates of the 185 units on 100ms windows are time

embedded using 6th order tap delay, this creates a representation space where each

state is a vector with 1295 dimensions.

The simplest version of the problem limits the number of targets to 2 (right and left),

and the targets should be reached within a single step. The time delayed neural net

(TDNN) has already been applied to RLBMI experiments, and its applicability in neural

decoding has been validated in [13, 31]. Thus, the performance of the Q-KTD algorithm

is compared with a TDNN as a mapper. The total number of trials is 43 for the 2 targets.

For Q-KTD, we employ the Gaussian kernel (4–3), and the kernel size h is heuristically

chosen based on the distribution of the mean squared distance between pairs of input

states; let s = E [∥xi − xj∥2)], then h =√s/2. For this particular dataset, the above

heuristic gives a kernel size h = 7. The stepsize η = 0.3 is selected based on the

stability bound that was derived for KLMS [28],

97

η <N

tr [Gϕ]=

N∑N

j=1 κ(x(j), x(j))= 1. (7–1)

The initial TD error is set to zero, and the first input vector is assigned as the first

unit’s center. After 43 trials, we count the number of trials which received a positive

reward, and the success rate is averaged over 50 Monte Carlo runs. Figure 7-2 shows

the average learning rates of Q-learning via KTD(0) and TDNN.

Figure 7-2. The comparison of average learning curves from 50 Monte Carlo runsbetween Q-KTD(0) and MLP.

KTD(0) reaches around 100% success rate after 2 epochs. In contrast, the average

success rate of TDNN slowly increases yet never reaches the same performance as

KTD. The solid line shows the mean success rates and the dashed line shows the

confidence interval based on the standard deviation. Since all the parameters are fixed

over 50 Monte Carlo runs, the confidence interval for KTD(0) can be simply associated

with the random effects introduced by the ϵ-greedy method employed for action selection

with exploration; thus, the narrow interval. However, with the TDNN a larger variation

of performance is observed, which shows how the initialization, due to local minima,

influences the success of learning; it is observed that the TDNN is able to approximate

the KTD performance, but most of the time, the system is stuck on local minima. From

this result, we can highlight one of the advantages of KTD(0) compared to MLPs, which

is the insensitivity to initialization.

98

However, one apparent disadvantage of using a nonparametric approach such

as KTD is the growing filter structure, which is considered a prohibitive constraint for

practical applications; the filter size increases linearly with the input data, which in an

online scenario is prohibitive. Therefore, methods for controlling the growth of the filter

are necessary; fortunately, there exists methods to avoid this problem such as the

surprise measure [25] or the quantization approach [9], which are incorporated in our

algorithm for the 2 target center-out reaching tasks. Without controlling the filter size, the

success rates reach around 100% within 3 epochs, but within only 20 epochs, the filter

size becomes as large as 861 units. Using the surprise measure [25], the filter size can

be reduced to 87 centers with acceptable performance. However, quantization method

[9] allows the filter size to be reduced to 10 units and to have performance above 90%

success rate. Therefore, more experiments applying the quantization approach are

conducted. Figure 7-3 shows the effect of filter size in the 2 target experiment.

Figure 7-3. The average success rates over 20 epochs and 50 Monte Carlo runs withrespect to different filter sizes.

For filter sizes as small as 10 units, the average success rates remain stable. Thus,

filter size 10 can be chosen when efficient computation is necessary. Figure 7-4 shows

the learning curves corresponding to different filter sizes in comparison with TDNN.

The average success rates are computed over 50 Monte Carlo runs.

99

Figure 7-4. The comparison of KTD(0) with different final filter sizes and TDNN with 10hidden units.

As we pointed out, in the case of total filter size of 10 (red line), the algorithm shows

almost the same learning speed as the linearly growing filter size, with success rates

above 90%. When we compare the average learning curves to TDNN, even a filter with 3

units (magenta line) using KTD(0) performs better than TDNN.

In the 2 target single step center out reaching task, Q-KTD(0) showed promising

results solving the initialization and growing filter size issues. Further analysis of

Q-KTD(0) is conducted on a more difficult task involving a larger number of targets.

All the experimental values are kept fixed using the same set up from the above

experiments. The only changes are the number of targets from 2 to 8 (1 ∼ 8) and

stepsize η = 0.5.

Since the total number of trials is 178 in this experiment, without any mechanism to

control the filter size, the filter structure can grow up to 1780 units within 10 epochs. The

quantization approach [9] is again applied to reduce the filter size. Intuitively, there is an

intrinsic relation between quantization size ϵU and kernel size h. Consequently, based on

the distribution of squared distance between pairs of input states, various kernel sizes

(h = 0.5, 1, 1.5, 2, 3, 5, 7) and quantization sizes (ϵU = 1, 110, 120, 130) are tested. The

100

corresponding success rates for final filter sizes of 178, 133, 87, and 32 are displayed in

Figure 7-5.

Figure 7-5. The effect of filter size control on 8-target single-step center-out reachingtask. The average success rates are computed over 50 Monte Carlo runsafter the 10th epoch.

Again, since all the parameters are fixed over the 50 Monte Carlo runs, the narrow

error bars are due to the random action selection for exploration, and this small variation

supports that this kernel approach does not heavily depend on initialization unlike

the conventional TD learning algorithms such as neural nets. With a final filter size of

178 (blue line), the success rates are superior to any other filter sizes for every kernel

sizes tested, since it contains all the input information. Especially for small kernel

sizes (h ≤ 2), success rates above 96% are observed. Moreover, note that even

after reduction of the state information (red line), the system still produces acceptable

success rates for kernel sizes ranging from 0.5 to 2 (around 90% success rates).

Intuitively the largest kernel sizes that provide good performance are better for

generalization; in this sense, a kernel size h = 2 is selected since this is the largest

kernel size that considerably reduces the filter size and yields a neural state to action

mapping that performs well (around 90% of success rates). In the case of kernel size

h = 2 with final filter size of 178, the system reaches 100% success rates after 6 epochs

101

with a maximum variance of 4% (Figure 7-6). To observe the learning process, success

rates are calculated after each epoch (1 epoch contains 178 trials).

Figure 7-6. The average success rates for various filter sizes.

The 8-target experiment shows the effect of the filter size, and how it converges

after 6 epochs (Figure 7-6). As we can see from the number of units in both cases,

higher representation capacity is required to obtain the desired performance as the

task becomes more complex (Figure 7-4 and 7-6). The results of the algorithm on the

8-target center-out reaching task showed that the method can effectively learn the

brain-state action mapping for this task and is still feasible.

7.1.4 Center-out Reaching Task - Multi-Step

Here, we want to develop a more realistic scenario. Therefore, we extend the task

to multi-step and multi-target experiments. This case allows us to explore the role of

the eligibility traces in Q-KTD(λ). The price paid for this extension is that now, lambda

0 < λ < 1 selection needs to be carried out according to the best observed performance.

Testing based on the same experimental set up as with the single step task, that is, a

discrete reward value is assigned at the target, causes extremely slow learning since

no guidance is given. The system requires long periods of exploration until it actually

reaches the target. Therefore, we employ a continuous reward distribution around the

102

selected target defined by the following expression:

r(s) =

prewardG(s) if G(s) > 0.1,

nreward if G(s) ≤ 0.1.

where G(s) = exp[(s − µ)⊤C−1

θ (s − µ)] (7–2)

where s ∈ R2 is the position of the cursor, preward = 1, and nreward = −0.6. The mean

vector µ correspond to the selected target location and the covariance matrix

Cθ = Rθ

7.5 0

0 0.1

R⊤θ where Rθ =

cos θ sin θ

− sin θ cos θ

depends on the angle θ of the selected target as follows: for targets one and five the

angle is 0, two and six −π/4, three and seven π/2, and four and eight π/4. Figure 7-7

shows the reward distribution for target one.

Figure 7-7. Reward distribution for right target.

The same form of distribution is applied to the other directions centered at the

assigned target point. The black diamond is the initial position, and the purple diamond

shows the possible directions including the assigned target direction (red diamond).

Once the system reaches the assigned target, the system earns a maximum reward

of +1, and receives partial rewards according to (7–2) during the approaching stage.

103

When the system earns the maximum reward, the trial is classified as a successful trial.

The maximum number of steps per trial is limited such that the cursor must approach

the target on a straight line trajectory. Here, we also control the complexity of the task

by allowing different number of targets and steps. Namely, 2-step 4-target (right, up,

left, and down); and 4-step 3-target (right, up, and down) experiments are performed.

Increasing the number of steps per trial amounts to making smaller jumps according

to each action. After each epoch, the number of successful trials are counted for each

target direction. Figure 7-8 shows the learning curves for each target and the average

success rates.

A 2-step 4-target B 4-step 3-target

Figure 7-8. The learning curves for multi step multi target tasks.

Larger number of steps results in lower success rates. However, the two cases

(two and four steps) obtain an average success rate above 60% for 1 epoch. This result

suggests that the algorithms could be applied in online scenarios. The performances

show all directions can achieve success rates above 70% after convergence.

7.2 Open Loop Reinforcement Learning Brain Machine Interface: Q-CKTD

We have already seen the performance of Q-KTD(λ) to find an optimal neural to

motor mapping. In this section, we want to compare the performance of Q-learning

via KTD(λ) and CKTD. Both algorithms are applied to passive data on a center-out

reaching task aiming at 4 targets (right, up, left, and down). The difference between

104

the passive data and the data employed in the previous section is that in the passive

data the monkey does not perform any movement, it only observes how the position

of a cursor changes over time. Neural states are recorded while the monkey watches

the screen changing through the duration of the experiment. Spike times from 49 units

are converted to firing rates using a 100ms window, and a 9th order tap delay line is

applied to input, hence, 490 dimensions are used to represent the neural states. The

total number of trials is 144, and each trial is initialized at the center, and allows 2 steps

to approach the target. The distance between the initial point (center) and the target can

be covered in 1 step. A trial is terminated once it passes 2 steps or receives positive

reward +1.5. Here, the positive reward value is assigned when the cursor reaches the

reward zone (0.2 distance from an assigned target). Otherwise, it earns negative reward

−0.6. The discount factor is γ = 0.9, the exploration rate ϵ = 0.01, and stepsize η = 0.5.

In these experiments, we do not apply any filter size control. The kernel size for

KTD is chosen based on the distribution of squared distance between pairs of input

states resulting in h = 0.8. When we fix the filter kernel size to h = 0.8 and apply

Q-CKTD, there is no significant difference between Q-KTD and Q-CKTD. However, by

changing the filter kernel size to 1, Q-CKTD shows improvement over KTD. Here, the

correntropy kernel size is hc = 1.

The success rates at each epoch are obtained as the average number of successful

trials over the 4 targets. These success rates are further estimated by averaging over 50

Monte Carlo runs; the results are displayed in Figure 7-9.

The average success rates between Q-KTD with filter kernel size h = 0.8 and

Q-CKTD with filter kernel size h = 1 and correntropy kernel size hc = 1 are compared

(Figure 7-9 (a)). CKTD shows improved success rates for the 1st and 2nd epochs.

However, the success rates after the 3rd epoch remain essentially equal to those for

Kernel TD. Since correntropy KTD weights are a combination of the error values e with

the level of importance based on the κ(e), and the error distribution changes during

105

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epochs

Suc

cess

Rat

es

kernel TDCorrentropy KTD

A Correntropy KTD with fixed correntropykernel size 1.

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epochs

Suc

cess

Rat

es

kernel TDCorrentropy KTD

B Correntropy KTD with reduced kernel sizefrom 1 to 0.8 at 3rd epoch.

Figure 7-9. Average success rates over 50 runs. The solid line shows the mean successrates, and the dashed line shows the standard deviation.

the learning process, it is reasonable to assume that the size of correntropy kernel

may need to be adjusted as learning progresses. A principled method to select the

correntropy kernel size is still under development; however, we chose to manually set

changes in the correntropy kernel size by observing the evolution of the errors. The

correntropy kernel size hc is reduced from 1 to 0.8 at the 3rd epoch. As we predicted,

improvement in success rates is observed at the 3rd and 4th epochs (Figure 7-9 (b)).

This motivates further work on fine tuning the correntropy kernel size, and thus some of

the effort will be devoted to this issue.

To understand better the properties of Q-KTD and Q-CKTD, we observe the

behavior of other quantities such as the actual predictions of the Q-values and individual

success rates according to each target. These quantities are observed by employing the

same parameter set as in Figure 7-9 (b), but the results are obtained from a single run.

First, the Q-value changes are observed at each trial (Figure 7-10). Correntropy

KTD has slower convergence in Q-values than Kernel TD. However, correntropy KTD

shows higher success rates over time. In addition, when we check the Q-value changes

at the 1st epoch (Figure 7-11 (a) and (b)), correntropy KTD has higher values, and it

attempts to explore more directions during learning. Since the positive reward is 1.5 and

106

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Suc

cess

Rat

es

Epochs

0 200 400 600 800 1000 1200 1400−1

0

1

2

Q−

valu

es

Trial Numbers

A Kernel TD.

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Suc

cess

Rat

es

Epochs

0 200 400 600 800 1000 1200 1400−1

0

1

2

Q−

valu

es

Trial Numbers

B Correntropy KTD.

Figure 7-10. Q-value changes per tiral during 10 epochs.

the Q-value represents expected reward given state and action, it is desirable as a value

predictor for it to converge to 1.5. Although the Q-values predicted by Kernel TD are

closer to the positive reward 1.5 (Figure 7-11 (c) and (d)), the variance of the Q-value

does not affect the success rates. This leaves an open question about what properties

of correntropy may be involved in this behavior, and it becomes an important reason to

carry out further analysis in order to fully understand the algorithm.

The success rate of each target is observed from 1st to 5th epoch (Figure 7-12).

Target indices 1, 3, 5, and 7 represent right, up, left, and down respectively. When we

apply Kernel TD, at the beginning, the learning direction tends to focus on certain

directions; during the first epoch the agent mainly learns the down direction (target

index 7), and during the second epoch the learning inclines towards the left direction

(target index 5) (Figure 7-12(a)). However, the learning variation over each direction in

correntropy KTD is smaller in comparison with Kernel TD (Figure 7-12 (b)).

7.3 Closed Loop Brain Machine Interface Reinforcement Learning

Q-KTD(λ) has been tested on open loop RLBMI experiments, and we have seen

that the algorithm performs well on the open loop RLBMI experiment. Therefore, the

application has progressed to closed loop RLBMI experiments. In closed loop RLBMI

experiments, the agent is trained to find a mapping from the monkey’s neural states

107

20 40 60 80 100 120 1400

1

2

3

4

5

6

7

8

Targ

et

Inde

xTrial Numbers

20 40 60 80 100 120 140−4

−2

0

2

4

6

8

10x 10

−3

Q−

valu

es

Trial Numbers

Target1

Target2

Target3

Target4

Target5

Target6

Target7

Target8

A At 1st epoch by KTD.

20 40 60 80 100 120 1400

1

2

3

4

5

6

7

8

Targ

et

Inde

x

Trial Numbers

20 40 60 80 100 120 140−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Q−

valu

es

Trial Numbers

Target1

Target2

Target3

Target4

Target5

Target6

Target7

Target8

B At 1st epoch by Correntropy KTD.

1300 1320 1340 1360 1380 1400 1420 14400

1

2

3

4

5

6

7

8

Targ

et

Index

Trial Numbers

1300 1320 1340 1360 1380 1400 1420 1440−0.5

0

0.5

1

1.5

2

Q−

va

lues

Trial Numbers

Target1

Target2

Target3

Target4

Target5

Target6

Target7

Target8

C At 10th epoch by KTD.

1300 1320 1340 1360 1380 1400 1420 14400

1

2

3

4

5

6

7

8

Targ

et

Index

Trial Numbers

1300 1320 1340 1360 1380 1400 1420 1440−0.5

0

0.5

1

1.5

2

Q−

va

lues

Trial Numbers

Target1

Target2

Target3

Target4

Target5

Target6

Target7

Target8

D At 10th epoch by Correntropy KTD.

Figure 7-11. Target index and matching Q-values.

to a robot arm position. The monkey has been trained to associate its neural states

with a particular task goal. The behavior task is a reaching task using a robotic arm, in

which the decoder controls the robot arm’s action direction by predicting the monkey’s

intent based on its neuronal activity. If the robot arm reaches to an assigned target, a

reward will be given to both the monkey(food reward) and the decoder(positive value).

Notice that the two intelligent systems learn co-adaptively to accomplish the goal. These

experiments are conducted in cooperation with the Neuroprosthetics Research Group

at the University of Miami. The performance is evaluated in terms of task completion

accuracy and speed. Furthermore, we attempt to evaluate the individual performance of

each one of the systems in the RLBMI.

7.3.1 Environment

During pre-training, a marmoset monkey has been trained to perform a target

reaching task aimed at two spatial locations (A or B trial); the monkey was taught to

associate changes in motor activity during A trials, and produce static motor responses

108

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

1st Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

2nd Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

3rd Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

4th Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

5th Epoch

A KTD.

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

1st Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

2nd Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

3rd Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

4th Epoch

1 3 5 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Target Index

Success R

ates

5th Epoch

B Correntropy KTD.

Figure 7-12. The success rates of each target over 1 through 5 epochs.

during B trials. When one target is assigned, the trial starts with a beep. To conduct the

trial during the user training phase, the monkey is required to steadily place its hand on

a touch pad for 700 ∼ 1200ms . This action produces a go beep that is followed by one

of the two target LEDs being lit on (A trial: red light for left direction or B trial: green light

for right direction). The robot arm goes up to home position, namely, the center position

between the two targets. Its gripper shows an object (food reward such as waxworm

or marshmallow for A trial and undesirable object (wooden bead) for B trial). For the A

trial, the monkey should move its arm to a sensor within 2000ms , and for the B trial, the

monkey should hold its arm on the initial sensor for 2500ms. If the monkey successfully

conducts the task, the robot arm moves to the assigned direction, the target LED light

blinks, and the monkey gets the food reward.

After the monkey is trained to perform the assigned task properly, a micro-electrode

array (16-channel tungsten microelectrode arrays, Tucker Davis Technologies, FL) is

surgically implanted under isoflurane anesthesia and sterile conditions. In the closed

loop RLBMI, neural states from the motor cortex (M1) are recorded. These neural states

become inputs to the neural decoder. All surgical and animal care procedures were

109

consistent with the National Research Council Guide for the Care and Use of Laboratory

Animals and were approved by the University of Miami Institutional Animal Care and Use

Committee. In the closed loop experiment, after the initial holding time that produces

the go beep, the robotic arm’s position is updated based solely on the monkey’s neural

states, and the monkey is not required to perform any movement unlike during the user

pre-training sessions.

During the real-time experiment, 14 neurons are obtained from 10 electrodes. The

neural states are represented by the firing rates on a 2sec window following the go

signal.

7.3.2 Agent

For the BMI decoder, we use Q-learning implemented with kernel Temporal

Differences (Q-KTD) (λ). The advantage of KTD for online applications is that it does

not depend on the initialization; neither does it require any prior information about input

sates. Also, this algorithm brings the advantages of both TD learning [50] and kernel

methods [44]. Therefore it is expected that the algorithm predicts properly the neural

state to action map, even though the neural states vary in each experiment. Based

on the monkey’s neural state, the BMI decoder produces an output using the Q-KTD

algorithm. The output represents the 2 possible directions (left and right), and the robot

arm moves accordingly.

One big difference between open and closed loop applications is the amount of

accessible data; in the closed loop experiment, we can only get information about the

neural states up to the current state. In the previous offline experiment, normalization

and kernel selection were conducted off line based on the entire data set. However,

it is not possible to apply the same method to the online setting since we only have

information about the input states up to the present time. Normalization is a scaling

factor which impacts the kernel size; proper selection of tge kernel size brings proper

scaling to the data. The dynamic range of states can change from experiment to

110

experiment. Consequently, in an online application, the kernel size needs to be adjusted

at each time. Before getting any neural states, the kernel size cannot be determined.

Thus, in contrast to the previous open loop experiments, normalization of the input

neural states is not applied, and the kernel size is automatically selected from the given

inputs.

For Q-KTD(λ), the Gaussian kernel (4–3) is employed. The kernel size h is

automatically selected based on the history of inputs. Note that in the closed loop

experiments, the dynamic range of states varies from experiment to experiment.

Consequently, the kernel size needs to be readjusted each time a new experiment takes

place and cannot be determined before hand. At each time, the distances between

the states are computed to calculate the output values. Therefore, we use the distance

values to select the kernel size as follows:

htemp(n) =

√√√√ 1

2(n − 1)

n−1∑i=1

∥x(i)− x(n)∥2 (7–3)

h(n) =1

n

[n−1∑i=1

h(i) + htemp(n)

](7–4)

Using the squared distances between pairs of previously seen input states, we can

obtain an estimate of the mean distance, and this value is also averaged along with past

kernel sizes to assign the current kernel size.

The initial error is set to zero, and the first input state vector is assigned as the

first unit’s center. Normalization of the input neural states is not applied, and a stepsize

η = 0.5 is used. Moreover, we consider γ = 1 and λ = 0 since our experiment performs

single step trials in (5–8).

7.3.3 Results

The overall performance is evaluated by checking whether the robotic arm reaches

the assigned target or not. Once the robot arm reaches the target, the decoder gets a

positive reward +1, otherwise, it receives negative reward −1.

111

Figure 7-13 shows the decoder performance for 2 experiments; the first experiment

(left column) has a total of 20 trials (10 A trials and 10 B trials). The overall success rate

was 90%. Only the first trial for each target was mis-assigned. The second experiment

(right column) has a total of 53 trials (27 A trials and 26 B trials), with overall success

rate of 41/53 (around 77%). Although the success rate of the second experiment is

Figure 7-13. Performance of Q-learning via KTD in the closed loop RLBMI controlled bya monkey for experiment 1 (left) and experiment 2 (right); The success (+1)and failure (−1) index of each trial (top), the change of TD error (middle),and the change of Q-values (down).

not as high as the first experiment, both experiments show that the algorithm learns an

appropriate neural state to action map. Even though there is variation among the neural

states within each experiment, the decoder adapts well to minimize the TD error, and

the Q-values converge to the desired values for each action; since this is a single step

task and the reward +1 is assigned for a successful trial, it is desired that the estimated

Q-value ~Q be close to +1.

112

It is observed that the TD error and Q-value are oscillating. The drastic change of

TD error or Q-value corresponds to the missed trial. The overall performance can be

evaluated by checking whether the robot arm reaches the desired target or not (the top

plots in Figure 7-13). However, this assessment does not show what causes the change

in the system values. In addition, it is hard to know how the two separate intelligent

systems interact during learning and how neural states affect the overall performance.

7.3.4 Closed Loop Performance Analysis

Since this RLBMI architecture contains 2 separate intelligent systems that co-adapt,

it is important to have not only a well performing BMI decoder but also a well trained BMI

user. Under the co-adaptation scenario, it is obvious that if one system does not perform

properly, it will cause detrimental effects on the performance of the other system. If the

BMI decoder does not give proper updates to the robotic device, it will confuse the user

conducting the task, and if the user gives improper state information or the translation is

wrong, the resulting update may fail even though the BMI decoder was able to find the

optimal mapping function.

Here, we analyze how each participant (agent and user) influences the overall

performance both in successful and missed trials by visualizing the states, corresponding

action values Q, and resulting policy in a two-dimensional space. This is the first attempt

to evaluate the individual performance of the subject and the computer agent on

a closed loop Reinforcement Learning Brain Machine Interface (RLBMI). With the

proposed methodology, we can observe how the decoder effectively learns a good state

to action mapping, and how neural states affect the prediction performance. A major

assumption in our methodology is to assume that the user always implements the same

strategy to solve the task, otherwise this analysis breaks down. Under these conditions,

when the system encounters a new condition we therefore assume that the user is

distracted or uncooperative. But this may not be the case and we did not have access to

enough extra information to quantify behavior besides visual inspection.

113

In the two-target reaching task, the decoder contains two output units representing

the functions Q(x , a = left) and Q(x , a = right). The policy is determined by

selecting the action associated with one of these units based on their Q-values. The

performance of the decoder is commonly evaluated in terms of success rate by counting

the successful trials that reach the desired targets, along with the changes in the TD

error or the Q-values. However, these criteria are not well suited to understand how

the two intelligent systems interact during learning. For instance, if there is a change in

performance or an error in the decoding process it is hard to tell which one of the two

subsystems is more likely to be responsible for it.

Another added difficulty in evaluating the user’s output is that the neural states are

high dimensional vectors. In this sense, we want to apply a dimensionality reduction

technique to produce a user’s output that can be visualized and easily interpreted,

nevertheless being independent of the class labels (unsupervised). We found that

principal component analysis (PCA) on the set of observed neural states is sufficient

for the goal of this analysis. PCA is a well known method to transform data to a new

coordinate system based on eigenvalue decomposition of a data covariance matrix. Let

X = [x(1), x(2), · · · , x(n)]⊤ be the data matrix containing the set of observed states

during the closed loop experiment until time n. A transformed dataset Y = XW can

be obtained by using the transformation matrix W, which corresponds to the matrix of

eigenvectors of the covariance matrix N−1X

⊤X. Without loss of generality we assume

that the data X has zero mean. The distribution of states up to time n can be visualized

by projecting the high dimensional neural states into two dimensions using the first two

largest principal components.

In this two-dimensional space of projected neural states, we can also show the

relation with the decoder by computing the outputs of the units associated with each

one of the actions and displaying them as contour plots. A set of two-dimensional space

locations Ygrid evenly distributed on the plane can be projected in the high dimensional

114

space of neural states as X̂ = YgridW⊤. Let Q(n)

i be the i unit from the decoder updated

using (5–9) at time n. We can compute the estimated Q-values at a point y on the two

dimensional plane using Q̂(n)(x̂ = Wy , a = i). In this way, we can extrapolate the

possible outputs that the decoder would produce in the vicinity of the already observed

data points. Furthermore, the final estimated policy can be obtained by selecting the unit

that matches the action with the maximum Q-value among all output units (Figure 7-14).

Figure 7-14. Proposed visualization method.

Here, we visualize the neural states and corresponding Q-values and policy π

related to the final performance. Thus, the final learned decoder Q̂(T ) and all the neural

states X are utilized; that is, n = T and X is of size T × d where d is the dimension

of the neural state vectors. Notice that the proposed method can also be applied at

any stage of the learning process; we can observe the behavior of two systems at any

intermediate time by using the subset of neural states that have been observed as well

as the learned decoder up to this time.

Figure 7-15 provides a visualization of the distribution of the 14 dimensional

neural states projected into two dimensions. The corresponding contour levels are the

estimated action values ~Q using the learned decoder from the closed loop experiment.

In addition, we provide the partition for left and right actions in the projected two

dimensional space, which corresponds to the final policy derived from the estimated

Q-values. The projection shows that the neural states from the two classes are

separable. As we expected, the Q-values for each direction have higher values on

regions occupied by the corresponding neural states. For example, the Q-values for the

115

−50 0 50

−120

−100

−80

−60

−40

−20

0

20

First Component

Second C

om

ponent

−50 0 50

−120

−100

−80

−60

−40

−20

0

20

First Component

−50 0 50

−120

−100

−80

−60

−40

−20

0

20

First Component

Second C

om

ponent

Estimated Policy

−50 0 50 100 150 200

−180

−160

−140

−120

−100

−80

−60

−40

−20

0

20

First Component

Q−value

A trial

B trial

−50 0 50 100 150 200

−180

−160

−140

−120

−100

−80

−60

−40

−20

0

20

First Component

Q−value

A trial

B trial

−50 0 50 100 150 200

−180

−160

−140

−120

−100

−80

−60

−40

−20

0

20

First Component

Second C

om

ponent

Estimated Policy

Q−value

A trial

B trial

Q−value

A trial

B trial0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−0.8

−0.6

−0.4

−0.2

0

0.2

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure 7-15. The estimated Q-values (top) and resulting policy (bottom) for the projectedneural states using PCA from experiment 1 (left column) and experiment 2(right column). The first and third top plots show the Q-values for “right”direction, and the second and forth top plots show the Q-values for “left”direction.

right direction have larger values for the areas filled by the states corresponding to B

trial. This is confirmed by showing the partitions achieved by the resulting policy.

During the training session, the success rates were highly dependent on the

monkey’s performance. Most of the times when the agent predicted the wrong target,

it was observed that the monkey was distracted, or it was not interacting with the task

properly. We are also able to see this phenomenon from the plots; the failed trials during

the closed loop experiment are marked as red stars (missed A trials) and green dots

(missed B trials). We can see that most of the neural states that were misclassified

appear to be closer to the states corresponding to the opposite target in the projected

state space. This supports the idea that failure during these trials was mainly due to the

monkey’s behavior and not to the decoder.

116

From the bottom plots, it is apparent that the decoder can predict nonlinear policies.

Finally, the estimated policy in experiment 2 (bottom right plot) shows that the system

effectively learns and goes from an initially misclassified A trial (during the closed loop

experiment), which is located near the border and right bottom areas, to a final decoder

where the same state would be assigned to the right direction. It is a remarkable fact

that the system adapts to the environment on-line.

We have applied the Q-KTD(λ) and CKTD algorithms to neural decoding for brain

machine interfaces. In the open loop RLBMI experiment, we confirmed that the system

was able to find a proper neural state to action mapping. In addition, we saw how

by using correntropy as a cost function there could be potential improvements to the

learning speed. Finally, Q-KTD(λ) was successfully applied to closed loop experiments.

The decoder was able to provide the proper robot arm actions. Finally, we explored a

first attempt at analyzing the behavior of the two intelligent systems separately. With

the proposed methodology, we observed how the neural state influences the decoder

performance.

117

CHAPTER 8CONCLUSIONS AND FUTURE WORK

The reinforcement learning brain machine interface (RLBMI) [13] has been shown

to be a promising paradigm for BMI implementations. It allows co-adaptive learning

between two intelligent systems; one is the BMI decoder on the agent side, and the

other is the BMI user as part of the environment. From the agent side, the proper neural

decoding of the motor signals is essential to control the external device that interacts

with the physical environment. However, there are several challenges that must be

addressed in order to turn RLBMI into a practical reality. First, algorithms must be able

to readily handle high dimensional states spaces that correspond to the neural state

representation. The mapping from neural states to actions must be flexible enough to

handle nonlinear mappings yet making little assumptions. Algorithms should require a

reasonable amount of computational resources that allows real time implementations.

The algorithms should handle cases where assumptions may not hold, i.e. the presence

of outliers or perturbations in the environment. In this thesis, we have introduced

algorithms that take into account the above mentioned issues. We have employed

synthetic experiments that illustrate the properties of the proposed methods and

encourage their applicability in practical scenarios. Finally, we applied these algorithms

to RLBMI experiments, showing their potential advantages in a relevant application.

We started by introducing three new temporal difference (TD) algorithms for state

value function estimation. State value function estimation is an intermediate step to

find a proper mapping from state to action, from which all fundamental features of the

algorithms could be observed. This functional approximation is able to handle large

amounts of input data which is often required in practical implementations. We have

seen how the proposed TD algorithms can provide functional approximation of state

value functions given a policy.

118

Kernel temporal difference (KTD)(λ) was proposed by integrating kernel-based

representations to the conventional TD learning. The big advantages of this kernel-based

learning algorithm are the nonlinear functional approximation capabilities along with

the known convergence guarantees of linear TD learning, which results in more

accurate and faster learning. Using the dual representations, it can be shown that

the convergence results for linear TD(λ) extend to the kernel-based algorithm. By using

strictly positive definite kernels, the linear independence condition is automatically

satisfied for input state representations in absorbing Markov processes. Experiments on

simulated data drawn from absorbing Markov chains allowed us to confirm the method’s

nonlinear approximation capabilities.

Moreover, robust variants of TD(λ) and KTD(λ) algorithms were proposed by using

correntropy as a cost function. Namely, correntropy temporal difference (CTD) and

correntropy kernel temporal difference (CKTD) were derived for the case of λ = 0.

TD(λ) and KTD(λ) use mean square error (MSE) as their objective function which has

known limitations for handling non Gaussian noise corrupted environments. Experiments

using a synthetic absorbing Markov chain showed CTD and CKTD are able to provide

better robustness performance than MSE under non-Gaussian noise or perturbed state

transitions.

We have observed that KTD(λ) has better performance with λ = 0 than larger λ due

to the relation between stepsize η and the eligibility trace rate λ. In multistep prediction

problems, when the Gaussian kernel is employed in the system, larger eligibility trace

rates require smaller stepsizes for stable performance, which also depends on the

allowed number of steps per trial. Small stepsizes for large λ make the performance

slower compared to the larger learning rates that small λ values allow. Thus, it is intuitive

that the performance of CTD and CKTD with larger λ may not perform as well as λ = 0

in on-line implementations. However, it is necessary to further explore the behavior

of CTD and CKTD with general λ. The extension of TD(0) and KTD(0) to general λ

119

using the multi-step prediction as a starting point does not seem to be applicable for

correntropy, since there is no obvious way to interchange terms in the cost due to the

nonlinearity of the kernel employed by correntropy; no updates can be made before

a trial is complete. The update rule we derived for general λ requires us to update

the system once a trial is complete. Therefore, further study for the derivation of CTD

and CKTD for general λ is required. In addition, we observed that CTD and CKTD

have stable performance. However, further analysis is still required to determine the

convergence points.

We extended all proposed algorithms to state-action value functions based on

Q-learning. This extension allows us to find a proper state to action mapping which

can be further exploited in practical cases such as the neural decoding problem in

reinforcement learning brain machine interfaces. The introduced TD algorithms were

extended to estimate action-value functions, and based on the estimated values, the

optimal policy can be decided using Q-learning. Three variants of Q-learning were

derived: Q-learning via correntropy temporal difference (Q-CTD), Q-KTD(λ), and

Q-CKTD.

The observation and analysis of CTD, KTD(λ), and CKTD gives us a basic idea

of how the proposed extended algorithms behave. However, in the case of Q-CTD,

Q-KTD(λ), and Q-CKTD, the convergence analysis is still challenging since Q-learning

contains both a learning policy and a greedy policy. In the case of Q-KTD(λ), the

convergence proof for Q-learning using temporal difference (TD)(λ) with linear function

approximation in [32] gives a basic intuition for the role of function approximation on

the convergence of Q-learning. For the kernel-based representation in Q-KTD(λ), the

direct extension of the results from [32] would bring the advantages of nonlinear function

approximation. Nonetheless, to apply these results, it is required an extended version

of the ordinary differential equation (ODE) method for Hilbert space valued differential

equations.

120

The extended algorithms were applied to find an optimal control policy in decision

making problems where the state space is continuous. We observed the behavior of

Q-KTD and Q-CKTD under various parameter sets including kernel size, stepsize, and

eligibility trace rate. From the experiments, we observed that the optimal filter kernel size

depends on the input distribution and affects the learning speed, and proper annealing

of the stepsize is required for convergence. For KTD small eligibility traces tend to work

better. In the case of correntropy, the kernel size presents a trade off between learning

speed and robustness and also depends on the error distribution. Results showed that

Q-KTD(λ) can offer performance advantages over other conventional nonlinear function

approximation methods. Furthermore, it is important to highlight how the robustness

property of the correntropy criterion can be exploited to improve learning under changing

policies. We have empirically observed that Q-CKTD was able to provide a better policy

in the off-policy learning paradigm.

Furthermore, Q-KTD(λ) was applied to estimate an optimal policy in open loop

brain machine interface (BMI) problems, and experimental results show the method can

effectively learn the brain-state action mapping. We also tested Q-CKTD on an open

loop RLBMI application to assess the algorithm’s capability in estimating a proper state

to action map. In off-policy TD learning, Q-CKTD results showed that the optimal policy

could be estimated even without having perfect predictions of the value function in a

problem involving a discrete set of actions.

Finally, we applied Q-KTD to closed loop RLBMI experiments using a monkey.

Results showed that the algorithm succeeds in finding a proper mapping between neural

states and desired actions. Therefore, the kernel filter structure is a suitable approach

to obtain a flexible neural state decoder that can be learned and adapted online. We

also provided a methodology to tease apart the influences of the user and the agent in

the overall performance of the system. This methodology helped us visualize the cases

121

where the errors may have been caused by the user as well as the decision boundaries

that the decoder implements based on the observed neural states.

We saw the successful integration of the proposed TD algorithms in policy search.

This shows that the introduced TD methods have the capability to approximate value

functions properly, which can contribute to finding a proper policy. Actor-Critic is another

well known method to find a policy using an estimated value function. We can also

extend the application of the Q-CTD, Q-KTD, and Q-CKTD algorithms to the Actor-Critic

framework. The Actor-Critic method combines the advantages of policy gradient and

value function approximation with the possibility of better convergence guarantees and

reduced variance on the estimation. The TD algorithms can be applied to the Critic to

estimate the value function, and the policy gradient method can be applied to update the

Actor that chooses the action [23].

122

APPENDIX AMERCER’S THEOREM

Let X be a compact subset of Rn. Suppose κ is a continuous symmetric function

such that the integral operator Tκ : L2(X)→ L2(X)

(Tk f )(·) =∫X

κ(·, x)f (x)dx ,

is positive, that is ∫X×X

κ(x , z)f (x)f (z)dxdz ≥ 0,

for all f ∈ Lx(X). Then we can expand κ(x , z) in a uniformly convergent series (on

X× X) in terms of function ϕj , satisfying ⟨ϕj ,ϕi⟩L2(X) = δij

κ(x , z) =∞∑j=1

λjϕj(x)ϕj(z).

Furthermore, the series∑∞

i=1 |λi | is convergent [33].

123

APPENDIX BQUANTIZATION METHOD

The quantization approach introduced in [9] is a simple yet effective approximation

heuristic that limits the growing structure of the filter by adding units in a selective

fashion. Once a new state input x(i) arrives, its distances to each existing unit C(i − 1)

are calculated

dist(x(i),C(i − 1)) = min1≤j≤size(C(i−1))

∥x(i)− Cj(i − 1)∥. (B–1)

If the minimum distance dist(x(i),C(i − 1)) is smaller than the quantization size ϵU , the

new input state x(i) is absorbed by the closest existing unit to it, and hence no new unit

is added to the structure. In this case, unit centers remain the same C(i) = C(i − 1), but

the connection weights to the closest unit are updated.

124

REFERENCES

[1] Bae, Jihye, Chhatbar, Pratic, Francis, Joseph T., Sanchez, Justin C., and Principe,Jose C. “Reinforcement Learning via Kernel Temporal Difference.” The 33rd AnnualInternational Conference of the IEEE on Engineering in Medicine and BiologySociety. 2011, 5662–5665.

[2] Bae, Jihye, Giraldo, Luis Sanchez, Chhatbar, Pratic, Francis, Joseph T., Sanchez,Justin C., and Principe, Jose C. “Stochastic Kernel Temporal Difference forReinforcement Learning.” IEEE International Workshop on Machine Learning forSignal Processing. 2011, 1–6.

[3] Baird, Leemon. “Residual Algorithms: Reinforcement Learning with FunctionApproximation.” Machine Learning. 1995, 30–37.

[4] Boser, Bernhard E., Guyon, Isabelle M., and Vapnik, Vladimir N. “A TrainingAlgorithm for Optimal Margin Classifiers.” In Proceedings of the 5th AnnualWorkshop on Computational Learning Theory (COLT). 1992, 144–152.

[5] Boyan, Justin A. Learning Evaluation Functions for Global Optimization. Ph.D.thesis, Carnegie Mellon University, 1998.

[6] ———. “Technical Update: Least-Squares Temporal Difference Learning.” MachineLearning 49 (2002): 233–246.

[7] Boyan, Justin A. and Moore, Andrew W. “Generalization in Reinforcement Learning:Safely Approximating the Value Function.” Advances in Neural InformationProcessing Systems. 1995, 369–376.

[8] Bradtke, Steven J. and Barto, Andrew G. “Linear Least-Squares Algorithms forTemporal Difference Learning.” Machine Learning 22 (1996): 33–57.

[9] Chen, Badong, Zhao, Songlin, Zhu, Pingping, and Principe, Jose C. “QuantizedKernel Least Mean Square Algorithm.” IEEE Transactions on Neural Networks andLearning Systems 23 (2012).1: 22–32.

[10] Dayan, Peter and Sejnowski, Terrence J. “TD(λ) Converges with Probability 1.”Machine Learning 14 (1994): 295–301.

[11] Deisenroth, Marc Peter. Efficient Reinforcement Learning using Gaussian Process.Ph.D. thesis, Karlsruhe Institute of Technology, 2010.

[12] Dietterich, Thomas G. and Wang, Xin. “Batch Value Function Approximation viaSupport Vectors.” Advances in Neural Information Processing Systems. MIT Press,2001, 1491–1498.

[13] DiGiovanna, Jack, Mahmoudi, Babak, Fortes, Jose, Principe, Jose C., andSanchez, Justin C. “Coadaptive Brain-Machine Interface via ReinforcementLearning.” IEEE Transactions on Biomedical Engineering 56 (2009).1.

125

[14] Engel, Yaakov, Mannor, Shie, and Meir, Ron. “Reinforcement learning withGaussian processes.” In Proceedings of the 22nd International Conference onMachine Learning. 2005, 201–208.

[15] Geramifard, Alborz, Bowling, Michael, and Sutton, Richard S. “IncrementalLeast-Squares Temporal Difference Learning.” In Proceedings of the 21st NationalConference on Artificial Intelligence. 2006, 356–361.

[16] Geramifard, Alborz, Bowling, Michael, Zinkevich, Martin, and Sutton, Richard S.“iLSTD: Eligibility Traces and Convergence Analysis.” Advances in Neural Informa-tion Processing Systems. 2007, 441–448.

[17] Ghavamzadeh, Mohammad and Engel, Yaakov. “Bayesian Actor-Critic Algorithms.”In Proceedings of the 24th International Conference on Machine Learning. 2007.

[18] Gunduz, Aysegul and Principe, Jose C. “Correntropy as a Novel Measure forNonlinearity Tests.” International Joint Conference on Neural Networks 89 (2009).

[19] Haykin, Simon. Neural Networks: a comprehensive foundation. Maxwell, 1994.

[20] ———. Neural Networks and learning Machines. Prentice Hall, 2009.

[21] Jeong, Kyu-Hwa and Principe, Jose C. “The Correntropy Mace Filter for ImageRecognition.” In Proceedings of the 16th IEEE Signal Processing Society Workshopon Machine Learning for Signal Processing. 2006, 9–14.

[22] Kim, Sung-Phil, Sanchez, Justin C., Rao, Yadunandana N., Erdogmus, Deniz,Carmena, Jose M., Lebedev, Mikhail A., Nicolelis, Miguel. A. L., and Principe,Jose C. “A Comparison of Optimal MIMO Linear and Nonlinear Models forBrain-Machine Interfaces.” Journal of Neural Engineering 3 (2006).145.

[23] Konda, Vijay R. and Tsitsiklis, John N. “On Actor-Critic Algorithms.” Societyfor Industrial and Applied Mathematics Journal on Control and Optimization 42(2003).4: 1143–1166.

[24] Kushner, Harold J. and Clark, Dean S. Stochastic Approximation Methods forConstrained and Unconstrained Systems. Springer-Verlag, 1978.

[25] Liu, Weifeng, Park, Il, and Principe, Jose C. “An Information Theoretic Approach ofDesigning Sparse Kernel Adaptive Filters.” IEEE Transactions on Neural Netwtorks20 (2009).12: 1950–1961.

[26] Liu, Weifeng, Pokharel, Puskal P., and Principe, Jose C. “Correntropy: Propertiesand Applications in Non-Gaussian Signal Processing.” IEEE Transactions on SignalProcessing 55 (2007).11: 5286–5298.

[27] ———. “The Kernel Least Mean Square Algorithm.” IEEE Transactions on SignalProcessing 56 (2008).2: 543–554.

126

[28] Liu, Weifeng, Principe, Jose C., and Haykin, Simon. Kernel Adaptive Filtering: AComprehensive Introduction. Wiley, 2010.

[29] Maei, Hamid Reza, Szepesvari, Csaba, Bhatnagar, Shalabh, and Sutton,Richard S. “Toward Off-Policy Learning Control with Function Approximation.”Proceeding of the 27th International Conference on Machine Learning. 2010.

[30] Mahmoudi, Babak. Integrating Robotic Action with Biologic Perception: A BrainMachine Symbiosis Theory. Ph.D. thesis, University of Florida, 2010.

[31] Mahmoudi, Babak, DiGiovanna, Jack, Principe, Jose C., and Sanchez, Justin C.“Co-Adaptive Learning in Brain-Machine Interfaces.” Brain Inspired CognitiveSystems (BICS). 2008.

[32] Melo, Francisco S., Meyn, Sean P., and Ribeiro, M. Isabel. “An Analysis ofReinforcement Learning with Function Aproximation.” In Proceedings of the25th International Conference on Machine Learning. 2008, 664–671.

[33] Mercer, John. “Functions of Positive and Negative Type, and Their Connection withthe Theory of Integral Equations.” Philosophical Transactions of the Royal Societyof London 209 (1909): 415–446.

[34] Moore, Andrew W. “Variable Resolution Dynamic Programming: Efficiently LearningAction Maps in Multivariate Real-valued State-spaces.” In Proceedings of the 8thInternational Conference on Machine Learning. 1991.

[35] Mulliken, Grant H., Musallam, Sam, and Andersen, Richard A. “DecodingTrajectories from Posterior Parietal Cortex Ensembles.” The Journal of Neuro-science 28 (2008).48: 12913–12926.

[36] Park, Il and Principe, Jose C. “Correntropy Based Granger Causality.” IEEEInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP).2008, 3605–3608.

[37] Pohlmeyer, Eric A., Mahmoudi, Babak, Geng, Shijia, Prins, Noe, and Sanchez,Justin C. “Brain-machine interface control of a robot arm using actor-criticrainforcement learning.” Annual International Conference of the IEEE on Engi-neering in Medicine and Biology Society (EMBC). 2012, 4108–4111.

[38] Principe, Jose C. Information Theoretic Learning. Springer, 2010.

[39] Rasmussen, Carl Edward and Kuss, Malte. “Gaussian Processes in ReinforcementLearning.” Advances in Neural Information Processing Systems. MIT Press, 2004,751–759.

[40] Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes forMachine Learning. MIT Press, 2006.

127

[41] Sanchez, Justin C., Tarigoppula, Aditya, Choi, John S., Marsh, Brandi T., Chhatbar,Pratik Y., Mahmoudi, Babak, and Francis, Joseph T. “Control of a center-outreaching task using a reinforcement learning Brain-Machine Interface.” The5th International IEEE/EMBS Conference on Neural Engineering(NER). 2011,525–528.

[42] Santamaria, Ignacio, Pokharel, Puskal P., and Principe, Jose C. “GeneralizedCorrelation Function: Definition, Properties, and Application to Blind Equalization.”IEEE Transactions on Signal Processing 54 (2006).6.

[43] Saunders, Craig, Gammerman, Alexander, and Vovk, Volodya. “Ridge RegressionLearning Algorithm in Dual Variables.” In Proceedings of the 15th InternationalConference on Machine Learning. 1998, 515–521.

[44] Scholkopf, Bernhard and Smola, Alexander J. Learning with Kernels. MIT Press,2002.

[45] Singh, Abhishek and Principe, Jose C. “Using Correntropy as a cost function inlinear adaptive filters.” The 2009 International Joint Conference on Neural Networks(IJCNN). 2009, 2950–2955.

[46] ———. “A Closed Form Recursive Solution for Maximum Correntropy Training.”2010 IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP). 2010, 2070 – 2073.

[47] ———. “A loss function for classification based on a robust similarity metric.” The2010 International Joint Conference on Neural Networks (IJCNN). 2010, 1–6.

[48] Singh, Satinder P. and Sutton, Richard S. “Reinforcement Learning with ReplacingEligibility Traces.” Machine Learning 22 (1996): 123–158.

[49] Sussillo, David, Nuyujukian, Paul, Fan, Joline M., Kao, Jonathan C., Stavisky,Sergey D., Ryu, Stephen, and Shenoy, Krishna. “A recurrent neural network forclosed-loop intracortical brain-machine interface decoders.” Journal of NeuralEngineering 9 (2012).2.

[50] Sutton, Richard S. “Learning to Predict by the Methods of Temporal Differences.”Machine Learning 3 (1988): 9–44.

[51] ———. “Open Theoretical Questions in Reinforcement Learning.” Tech. rep., AT& TLabs, 1999.

[52] Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction.MIT Press, 1998.

[53] Szepesvari, Csaba. Algorithms for Reinforcement Learning. Morgan & Slaypool,2010.

128

[54] Tsitsiklis, John N. and Roy, Benjamin Van. “An Analysis of Temporal-DifferenceLearning with Function Approximation.” Tech. Rep. 5, IEEE Transactions onAutomatic Control, 1997.

[55] Watkins, Christopher J. C. H. Learning from Delayed Rewards. Ph.D. thesis, King’sCollege, 1989.

[56] Watkins, Christopher J. C. H. and Dayan, Peter. “Technical Note: Q-Learning.”Machine Learning 8 (1992).3-4: 279–292.

[57] Xu, Xin, Hu, Dewen, and Lu, Xicheng. “Kernel-Based Least Squares Policy Iterationfor Reinforcement Learning.” IEEE Transactions on Neural Networks 18 (2007).4.

[58] Xu, Xin, Xie, Tao, Hu, Dewen, and Lu, Xicheng. “Kernel Least-Squares TemporalDifference Learning.” International Journal of Information Technology. vol. 11. 2005,54–63.

[59] Zhao, Songlin, Chen, Badong, and Principe, Jose C. “Kernel Adaptive Filteringwith Maximum Correntropy Criterion.” The 2011 International Joint Conference onNeural Networks (IJCNN). 2011, 2012–2017.

129

BIOGRAPHICAL SKETCH

Jihye Bae received a Bachelor of Engineering in the School of Electrical Engineering

and Computer Science at Kyungpook National University, Daegu, South Korea in 2007,

and the Master of Science and Doctor of Philosophy (Ph.D.) in the Department of

Electrical and Computer Engineering at University of Florida, Gainesville, Florida, the

United States of America in 2009 and 2013, respectively. She joined the Computational

Neuro-Engineering Laboratory (CNEL) at University of Florida in 2010 during her

Ph.D. studies and worked as a research assistant under the supervision of Prof. Jose

C. Principe at CNEL. Her research interests encompass adaptive signal processing,

machine learning, and their applications in brain machine interfaces including neural

decoding and control problems. Her current research mainly focuses on kernel methods

and information theoretic learning, and how both areas can be applied in reinforcement

learning.

130

kernel temporal differences for reinforcement...

Documents