families of estmator-based stochastic learning · chapter 1: introduction 1.1 general description...

FAMILIES OF ESTMATOR-BASED STOCHASTIC

LEARNING ALGORITHMS

by

M A R N A AGACHE, BSc.

A thesis submitted to the

Fsculty of Graduate Stuâies and Research

in partial fulfdment of the requhrnents for the degree of

Master of Computer Science

Ottawa-Carleton Institute foi Computer Science

School of Computer Science

Carleton University

Ottawa, Ontario

Jdnuary 2000

O copyright

2 0 , Mariana Agache

9 uisiîions and Acquisitions et Bib iognphic Setvices senrices bibliographiques

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Libraxy of Canada to Bibliothèque nationale du Canada de reproduce, loan, distn'bute or sel1 reproduire, prêter, distribuer ou copies of this thesis in microfonn, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de

reproduction sur papier ou sur format électronique.

The author retahs ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor subsmntial extracts fkom it Ni la thèse ni des extraits substantiels may be printed or othe~wise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

This thesis studies the field of Estimator-based Leamhg Automata. We fust

argue that the Reward-Penalty and Reward-Inaction learning paradigms in conjunction

with the continuous and discrete models of computation lead to four versions of Pursuit

Leamhg Automata. Such schemes permit the Learning Automaton to utilize the long-

terni and short-term perspectives of the environment. We present al1 four resultant

Pursuit algorithrns, and a quantitative cornparison of their performance.

The existing Rirsuit Algorithms 'pursue' the action that is cumntly estimated as

the k t action. In this thesis. we claim that by punuhg al l the actions with higher

estimates than the chosen action, the perîormance of the Pursuit Aigorithm improves

considerably. To attest this, we introduce two new Pursuit algorithms, and justify their

supenonty through extensive simulations.

Thathachar and Sastry introduced the TSE estimator algorithm and pnsented its

updating equations in a scalar fom. in this thesis. we present a vectorial representation

of the TSE algorithm. and propose its generalized vectorial version. whose superiority is

aiso p m n experimentaiiy.

First and fonmost, 1 would iike to thank my Professor. Dr. John Oommen, for bis

continuous support, encouragement, and for believing in me. Throughout these years, his

guidance. ideas, and enthusiasm gave me the opportunity to explore new scientific

domains and to discover new ~ c a r c h horizons, for which 1 am deeply gratehil.

1 wodd also Like to thank Professor M. A. L. Thathachar, of the Indian Institute of

Science, Bangalore, India, for his valuable comments and suggestions related to the work

presented in the Chapter 3 of this thesis.

1 would like to thank my parents, my sister and Wellington for their help, support,

and encouragement.

To my parents.

CEMPIBR 1: INTRODUCTION . r a a t m . . m m m m ~ m ~ m m m m ~ . m m m m ~ . m m m m m m m m m m m m a m m m m m m m m m m m m m m m m m m m o m m m m m m a 1

1 .1 General description of the leamhg automata ...... .. . . ... ..... ...... .... .. .. . . . . . 1

1.2 Justification and objectives of the Thesis .............. ... ..... ... .... + . . . .............. 6

1.3 Contributions of the Thesis ............................ .*.......CIC ....... .........................*..*.......... 7

1.4 Content and organization of the Thesis ................... .e.e.~....e~.......~............................ 9

CI~APTER 2 t L ~ G AUTOMATAIAN O ~ V W ..m.mma~.smmmmmmmmmmmmmw.aaommmaa.t...aoomommoo 11

2-1 Definition of Automaton ....,.. ... .................*........+.. .. ..... , ... . . ........ .........1 1

2.1.1 Determiaistic Automaton ........,..... , .... ,.......e......,.........*...*..................**..... 12

2.1.2 Stochastic Automaton ..................................... .-...... .................................. 13

vi

2.2 The Leaming Automaton .................................. ..................................................... 15

.............................................................. ............ 2.2.1 The Environment ..... .... 1 6

2.2.2 Defdtion of the Learning Automaton ................... .. ................................... 17

2.2.3 N o m of behavior ...................... ...................................................................... 18

2.3 Fied Structure and Variable Structure Leaming Automata ........... ................... ... 21

2.3.1 Fixed Stmcture Automata ....................................................................... 21

2.3.1.1 Tsetlin Automaton ................ .. ................................................................ 21

2.3.1.2 Krinsky Automaton ...................... .... ................. ................................ 24

.................................... .. ........ ...... 2.3.1.3 Krylov Automaton ...... .. ... .... 26

2.3.2 Variable Structure Stochastic Automata ...................................................... 2 7

2.3.2.1 Linear Reward-Penalty Scheme (LRp) .............. ..... .............. ................. 31

2.3.2.2 Linear Reward-Inaction Scheme (LR3 ..................................................... 32

2.3.2.3 b a r Inaction-Penalty Scheme (La) ............................. ........ ................ 33

2.3 -3 Discniwd Leaming Automat a. .......... ..... .. ...... ...... 2.3.3.1 Discreiized Linear Reward-Inaction Automaton ................................ 35

.................. ............. 2.3.3.2 Discretized Linear Inaction-Penalty Automaton .. 3 8

.................. .................. 2.3.3.3 Discreiized Linear Reward-Penalty Automaton .... 39

2.4 Estirnator Algor i th ..................................................................... d 2.4.1 Overview ...................... .. ...... ........- 2.4.2 Continuous Estimator Algorithms .................................................................... 45

2.4.2.1 Pursuit Algorithm ...................... .. ....... ............................m....-............ ..45

2.4.2.2 TSE Algorithm ..................... ................................. ................................ 47

vii

2.4.3 Discrete Estimator Algorithm ......................................................................... 5 1

2.4.3.1 Discrete Pursuit Algorithm ........... .. .... .. ....................................... 5 2

24.32 Discnte TSE Algorithm .............................. ............................ .. ............ 54

2.5 Conclusions ................... .......................... ...................... 5 8

CHAITER 3: NEW PURSUIT ALGORX'RXMS O O O O O O O ~ O ~ O ~ ~ O ~ O O O . O ~ O O H O O ~ O O ~ O ~ ~ O O H H O ~ ~ ~ ~ O ~ ~ H O O O W H H ~ O ~ ~ ~ 60

.................................... 3.1 introduction ........ ............................................................. 60

3.2 Continuous Reward-inaction Rusuit Algorithm (CPR J ................................. 62

3.3 Discntized reward-penalty h u i t Algorithm (DPRP) .... .................... .............. 66

...................... 3.4 Simulation Resuits .. ........ ............................................... .......... 69

3.5 Conclusions ............................................................................................................. 75

CHAPTER 4: GENEWIZATION OF THE PURSUIT ALGORITHM ~ 0 ~ ~ 8 0 0 0 0 ~ 0 0 0 0 0 0 ~ 0 0 ~ 8 0 ~ 0 ~ 8 0 0 0 0 0 0 ~ 0 0 8 77

4.1 Introduction ............... ........................ ................................................................. 77

4.2 Generaiizeù Rumit Algorithm .......................................................................... 8 0

4.3 Discretized Generalized Pursuit Algorithm ................... .. ............. ................. 86

4.4 Simulation Results .............. .............. .................... .. ............................................................................... 4.5 Conclusions 93

CHAFM3R 5: GENERALIZATION OF THE TSE A L c 0 ~ m r i l i o . o e w w e m e w ..8*e.*eSee**eHeeCeO*..***.e m. 94

5.1 Vectorial representation of the TSE algorithm .................................................... 94

5.2 ûeneralization of the TSE algorithm ...................... ....... ........................... 98

5.3 Simulation Results .................................... ......................................................... 105

........................................................................................................... 5.4 Conclusions 1 0

viii

6: CONCLUSIONS b b m m b b o ~ b o b b b m b b b b b b ~ ~ b b O ~ ~ b b b m m w m ~ m o O b b m m b b m o b b m o m m m b . o m ~ b b b b b o o o o ~ ~ m ~ o b ~ ~ b m o O ~ o 111

6.1 Summary ............................... ....................................................................... 1 1 i

6.2 Future work ........................................................,.....................~.................~.......~.. L 15

LIST OF TABLES

Table 2.1: Experimental comparative performance of DLRi with other FSS A [ 141. ...... 38

Table 2.2: Comparative performance of DLRi. ADLp, ADLW (c2a.8) ....................... 4 1

Table 2.3: Comparison between continuous and discrete linear VSSA ......................... 42

Table 2.4: The number of iterations until convergence in two-action environments for

the TSE Algorithm [IO] ........................................................ .... *-...-...* .......... 56

Tabk 2.5: Comparison of the discrete and continuous estimator dgoritbms in

benchmark ten-action environments [ 10 1. ............... ................ ....... . . . . 5 7

Table 3.1: Comparison of the Pursuit algonthms in two-action benchmark environments

for which exact convergence was requi~d in 750 experiments (NE=750). .7 1

Table 32: Comparison of the Pursuit algorithms in two-action benchmark envhnrnents

for which exact convergence was required in 500 experiments (NE=Sûû). .7 1

Table 3.3: Comparison of the Rirsuit algonthms in new two-action environments for

which exact convergence was required in 750 experiments -750) ........ 73

Table 3.4: Cornpanson of the Pursuit algorithms in ten-action benchmark environments

for which exact convergence was required in 750 experiments -750). .74

Table 33: Comparison of the h u i t algorithms in ten-action benchmark environments

for which exact convergence was nquired in 500 experiments (NErSOO). .74

X

Tabk 4.1 : Performance of the generalized Pursui t algorîthms in benchmark ten-ac tion

environments for which exact convergence was required in 750 experiments

-750) .. .....r........ir....o...................................-..................................~o..-.. 91

Table 5.1: Performance of the GTSE and TSE algorithms in ten-action benchmark

enviroaments for which exact convergence was requhd in 750 experiments

(NE=750) ................... .... ................................................................. 106

Table 5.2: Performance of the GTSE and DTSE algorithms in benchmark ten-action


(NE=750) .................. .,...., ....................................................................... 109

Fipre 2.1:

Figure 2.2:

Figure 2.3:

Figure 2.4:

Figure 2.5:

Figure 2.6:

Figure 3.1:

Figure 4.1:

Fipre 4.2:

Figure 5.1:

The automaton. .... ...... .... ..... .... . ........ ............................. . . . . . . . 1 1

The environment ..... . ........ ........., ... . . .............. . . . . ......... . 1 6

Feedback connection of automaton and environment. ...............-.. .... ... .. ..... 17

State transition graphs for the Tsetîin automaton LzN2 ................... ............ 22

State transition graphs for the Krinsky automaton K ~ ~ ~ , ~ ....... . .................... 25

State transition graphs for the Krylov automaion K ~ ~ , ~ ............... ........ .... ..26

Cornphson of the Punuit algonthms in two-action benchmark


(NE=750). ............................................................*.........* .... ....b...... ......... ...72

Solution approach of the CPRp Pursuit Algorithm and the Generalized

Pursuit algonthms ........ ................ ... ... .......... ...................... . . . . . . 7 8

Performance of the Pursuit Algorithm relative to the CPRp algonthm in ten-

action environments for or which exact convergence was required in 750

experiments ..................................... . . . . ............. . b........................b... 92

Performance of the continuous Estimator Algorithrns relative to the CPRp

aigorithm in ten-action environments for or which exact convergence was

. * requlzed m 750 experiments ............ ... ..... ............................ . ......107

Figure 6.1: Performance of some Estimator Aîgorithms relative to the CPRp algorithm

in ten-action enviroments for which exact convergence was required in 750

experiments ................... .... .................. . ....... . .............. ........ 1 14

Chapter 1: INTRODUCTION

1.1 General description of the learning automata

Learning represents one of the most important psychological processes and it is

essential in the behavior of self-adjusting organisms. In psychology. it can be defined as

an organism's ability to modify its current behavior based on past behavior and the

consequences of its pnor cboices. The psychological and bioiogicai concepts of leaming

have been widely studied and have been incorporated in many engineering systems that

deal with incomplete information and uncertainty. Applications such as Adaptive

Control Systems. Pattern Recognition. Game Theory, and Objeci Panitioning, needed to

incorporate leaming characteristics in their huictionality in order to mode1 systerns with a

substnntial amount of uncertainty. Many of these systems are required to choose the

correct action (decisions) without a priori knowledge of the consequences of perfonning

these actions. To perform well under these conditions of uncertainty, the systems needed

to acquire some knowledge about the consequences of performing the various actions

This acquisition and utiluation of relevant knowledge in order to improve the

performance of a system is posed as the learning problem. 'ïhe goal of the learning

problem is to compensate for the insuffient infornation by appropriate data collection 1

and processing, while moving the process towards its solution. In these engineering

systems, leming has been implemented using various methods and techniques, namely:

stochastic approximation methods] [3], heuristic programmuig techniques [23], inductive

Uiferential techniques 1251, and statisticai infenntial techniques [4] [SI.

Another approach to solving the leaming problem was initiated by the Russian

mathematician Tsetlin. He introduced, in 196 1, a new mode1 of computer learning that is

now cailed a lraning automaton. and which wili be the focus of the study presented in

this thesis. The goal of such an automaton is to determine the optimal action out of a set

of dlowable actions. The functionality of the leaming automaton cm be described in

terms of a sequence of npetitive feedback cycles in which the automaton interacts with

an Environment. The automaton chooses an action that triggers a response from the

Environment. Such a response cm be either a reward or a penaity. The automaton uses

this response and the knowledge acquired in the p s t actions to determine which is the

next action. The term Environnent refers, in general. to the collection of al1 extemal

conditions and influences affecting the life and development of an organism or system.

In the context of leaming automata, the ierm defines a random unknown 'media' in

which an automaton or a group of automata can operate.

Learning automata can operate individually or c m be interco~ected in a

hierarcbicd or distributed fashion [30]. Some of the applications using leamhg automata

that operate in such diffennt organizational models are: telephone and traffic routing and

control [13], game theory [30]. stoc hastic geornetric problems [15], pattern recognition

[30], and the stochastic point location problem [20].

Leaniing automata can be classified with respect to their transition states as king

either detenninistic or stochastic. For a detednistic automaton, given an initiai state

and input, the next state and action are uniquely specified. For a stochastic automaton,

given an initial state and input sequence, there is no certainty r e g d n g the next states

and actions of the automaton. These automata can m e r be classified with respect to

their transition functions in iwo categories: Fked Structure Stochastic Automata (FSSA)

and Variable Structure Stochastic Automata (VSSA). In the case of FSSA, the state

transition and output functions are independent of time. and are thus considend to be of a

'îixed structure". The VSSA are designed with more flexibility, allowing the state

transition and output function to Vary in time.

The earlier models of stochastic leaming automata coasider the probability space

as a continuous space, the action probabilities king able to take any value in the interval

[O, 11. ui 1979, Thathachar and Oornmen [26] opened a new direction in the evolution of

the field of leamhg automata by introducing the concept of discretized leaniing

automuta. These automata operate in the probability space [O,l], which is divided into a

finite nurnber of continuous subintervals, and the probabilities of choosing different

actions an ailowed to assume values bom this finite set. The leaming automata designed

using this technique exhibit a decrease in their convergence tirne. The are therefore faster

than the leaming automata designed to use a continuous probabiiity space.

In the attempt to mode1 the psychological concepts of Ieaniing, various VSSA

usai different leaming paradip . Specifically, the basic concepts of the operant

conditioning leaming method have been applied in the context of the VSSA algorithrns.

These algorithms would update the action probabilities in the following situations: a)

wben the environment rewarded or penalized an action, b) when the environment

rewarded an action, ignoring the penalties, c) when the environment penalized an action,

ignoring the rewards.

In the quest to design even faster converging leaming algorithms, Thathachar and

Sastry [27] introduced a new class of learning automata, called estimator algorithms.

T h e aigorithrns are cbaracterized by the fact that they maintain running estimates for

the penalty probabüity of each possible action, and use hem to update the probabilities of

chwsîng each action. Thathachar and Sastry introduced the concepts of the estimator

algorithrns by F i t presenting the Pursuit Estimator Algorithm [27]. This leaming

aigorithm 'pursues' the action that is considered to have the highest reward estimate.

Later, in 1984, the same authon presented another estimator dgonthm, the TSE

aigorithm. wbich increases the action probabilities for al1 the actions that have higher

reward estirnates than the current chosen action. Oommen and Lanct6t continued the

study of the estimator algorithms and presented in 1990 discntized versions of the

Punuit and TSE algorithm.

Leaming Automata can be used to solve other, mon general, leaming problems or

can be employed as basic l e d g elements of other leaming machines. For instance, the

resevch on LA directly influenced the trial-and-error thrcad of reinforcement leurning,

leading to modem reinforcement leaming research [24]. Due to their leaniing

phiiosophy, LA c m be employed in solving simplifïed reinforcement leaming problems.

Specificdy, Sutton and Barto in (241 showed how LM, LRP, and the Pursuit schemes cm

be used in solving evaluative feedback problems such as the n-armed bandit problem.

Because the LA leam to choose the optimal action h m a set of ailowable actions, the

structure proves to be impractical for solving the full reinforcement learning problem,

which has as goal to maximize the total amount of reward received over the long mn. A

method that solves the full reinforcement problem is the Q-leaming method, whose

objective is to find a control ruk that maximizes at each time step the expected

discounted sum of future reward. From this perspective. the LA differ from the Q-

leaming method because they attempt only to l e m the optimal action, independently of

the expected sum of rewards.

Narendra and Thathachar (1 I ] pnsented various models on interconnected

learning automata. such as synchronous and sequential modes, hierarchies and networks

of learning automata. The last mode1 of interconnected automata was strongly intluenced

by the multilayered artificial neural networks or co~ectionist networks, but with some

underlying diffennces. Specifically, the networks of leaming automata differ from the

neural network mode1 in the way the networks are adjusted. The neural networks are

adjusted based on the e m r between the output of the network and some desired output,

whenas the adjustment in the network of automata depends on the ranùom response from

the environment. In both cases. their performance fuactions are improved by the

adjustment of the weight vectors. Further cesearch is necessary in the field of L e d g

Automata to determine the weights of a leaming network using the learning automata

schemes.

1.2 Justitication and objectives of the Thesis

This thesis concentrates on the study of the estimator algorithms, focusing

primarily on the introduction of new and better-performing estimator algoriibms. It also

studies the characterization of their performance in cornparison to the existing estimator

algorithms.

The original Pursuit algorithm presented by Thûthachar and Sastry is a continuous

algorithm that updates the action probabili ties whenever the environment rewards and

penalizes an action. Oornmen and Lanctôt [18] Iater extended the Punuit algorithm into

the discntized world by presenting the Discretized Pursuit Algonthm based on the

reward-inaction learning paradigrn. The Reward-Penaity and Reward-Inaction leaming

paradigms in conjunction with the continuous and discrete models of computation leads

to four versions of Punuit Leaming Automata. but only two of hem have been presented

in the literature. This represents a gap in the class of the learning automata, which we

address. Hence, one of the objectives of this study is to introduce the new versions of

Pursuit algorithms that completely covers ail possible versions of these algorithms, and to

present a thorough cornparison of their performance based on simulation results.

The Pursuit algorithm 'pursues' the action that is cumntly estimated as the best

action. This implies that, if at a certain time in the evolution of the automaton, the action

that is estimated to be the best action is not the action with minimum penalty probability,

the automaton pursues a wroag action. This behavior of the automaton can considerably

increase its convergence t h e in such environments and we consider this as a limitation in

the design of Putsuit aigoriihms. Therefoce, another objective of this thesis is to impmve

the design of the Pursuit algorithm in order to minimize the probability of pursuing a

mong action, and to increase its convergence performance.

Oommen and Lanctôt [18] presented a continuous version of the Pursuit in a

vectoriai form, allowing for a better undentanding of the concepts of the hirsuit

dgorithrn. To better outline the underlying concepts of the TSE algorithm. a goal of our

study is to generate a vectorial fom of the TSE algorithm. and to propose new directions

of improvement of the performance of the TSE algorithm.

In sumrnary, the m i n objectives of this thesis are:

to ma te new Pursuit algorithms that utilize in their design the reward-penalty

and rew ard-inaction leamhg paradigms, and to stud y their performance:

to mate new Pursuit algorithms that minimize the probability of choosing a

wrong action. there fore exhibithg a better convergence performance;

to generate a vectotial representation of the updating equations of the TSE

aigorithm and to determine new directions of improving the performance of

the TSE algorithm.

13 Contributions of the T hesis

In this thesis, we htmduce five new estimator algorithms: the Continuous Pursuit

Reward-Inaction Algorithm (CPRd, the Discrete Pursuit Reward-Penalty Algorithm

@PRP), the Generalued Rusuit Algorithm (GPA), the Discretized Generalized Rirsuit

Algorithm (DGPA), and the Generaîized TSE Algorithm (GTSE).

The Continuous Pursuit Reward-inaction (CPRd and the Discrete Pursuit Reward-

Penalty (DPw) dgorithrns, dong with the existing Continuous hirsuit Reward-Penalty

(CPRP) and Discrete Pursuit Reward-Inaction @PRI) algorithms, completely cover al1 the

possible venions of hirsuit algorithms that use the Reward-Penalty and Reward-inaction

leaming paradigms. The experimental results obtained dunng this study prove that

among these algorithms, the DPRi is the fastest one. Furthemore, we have found that the

Reward-Inaction schemes are generdly supenor to their Reward-Penalty counterparts.

and we have experimentally verified that the discretized schemes exhibit a beiter

performance than the continuos schemes.

The search for a new direction of improving the performance of the Pursuit

algorithm led to the development of the generalized Pursuit algorithms GPA and DGPA.

These algorithms generaüze the concepts of the original Pursuit algorithm by puauing

more than one action, therefore minimizing the probability of pursuhg a wrong action

and increasing the performance of these dgorithms. The Generalized Pursuit Algorithm

(GPA) proves to be the fasiest continuous Pursuit aigorithm, and the Discretized

Generalized Pursuit Algonthm (DGPA) proves to be the fastest discretized Rusuit

algorithm and, more generally, the fastest Pursuit algorithm reported to date.

Another contribution of this thesis is the development of an original vectorial

representation of the updaîing equations of the TSE algorithm. Tbis vectonal

repcesentation open a new direction of generaüzing the TSE algorith, and leads to the

introduction of the novel GTSE algocithm. This new aîgorichm experimentally proves to

be the fastest coniinuous estimator algorithm reported to date.

Chaptcr 1: Iirrrto~uerio~ 9

AU the novel algorithms proposed in ihis thesis exhibit very g d convergence

properties. Also, aU of these algorithms have been proven hopfimal in any stationary

random environment.

1.4 Content and organiza tion of the Thesis

This thesis debuts with an overview of the field of learning automata in Chapter 2,

presenting frst the underlying mathematical models, and also describing the measures of

the performance of these algorithms, inciuding expediency, optimdity, E-optimality and

absolute expediency. Also, in this chapter we present the concepts of the Fixed Structure

and Variable Structure Stochastic Automata, along with various linear schemes that use

different learning updating pûndigms such as the reward-penalty, the reward-inaction

and the inaction-penalty, Moreover, the continuous and discntized models of leaming

automata are portrayed through the examples of various such learning schemes.

Chqter 3 addresses the fini objective of this thesis, and contains a study of the

Pursuit algorithms with respect to different leaming p d g m s such as reward-penalty

and reward-inaction. The novel work presented in this chapter was published in the

technical papa 1161.

Chnpter 4 presents two new generalued Rirsuit algorithms, along with a relative

cornparison of their performance. The work presented in this chapter addresses the

second objective of this thesis. which is to improve the performance of the h u i t

estimator algorithms by minllnizllig the probability of pursuing a wrong action.

Cbapter 1: ~ O W ~ O N 10

The vectorial representation of the updating equation of the TSE algorith is

presented in Chapter 5. together with a new generaüzed TSE algorithm and an evaluation

of its perfomûnce.

Chapter 6 concludes the thesis by pnsenting a summary of the results, and

outlining the main conclusions of this work.

Chapter 2: LEARNING AUTOMATA-& OVERVIEW

2.1 Definition of Automa ton

An automaton is defined, in general, as a system that directs itself and does not

need exterior guidance in order to function. Mathematically, an automaton is defined as a

set of input actions B, a set of states Q, a set of output actions A. and the hnctions F and

G needed to compute the next state of the automaton and, its output respectively. The

input determines the evolution of the automaton from its cwrent state to the next state, as

in Figure 2.1. If the output depends only on the cumnt state, the automaton is defined as

a state-output automatun.

Input The set of states Q Kt) q(t+l )=F(B(t),q(t))

Figurc 2.1: The automaton.

The following is a f m a l defullton of an automaton:

Chapter 2: L E A ~ ~ N ~ N G AUTOMATA - AN OV~RVIEW l2

Iklinition 2.1: An automaton is a quintuple <A,B,Q,F,G> , where:

A={al. a2,. . .,el is the set of output actions of the automaton, 2<rc-.

B is the set of input actions that cm be finite or infinite.

Q is the vector state of the automaton with q(t) denoting the state ai instant t, as:

Q=(qi(t), 92(t),--. qdt))

F: QxB+Q is the transition jùnction that detemiines the state nt the instant t+L in

ternis of the state and input at the instant t:

q(t+ 1 )=F(q(t).B(t)).

This mapping can be either detemiinistic or stochastic.

The output function G detennines the output of the automaton at any instant 't' based

on the state at the current instant:

a(t)=~(q(t))*

The mapping G:Q+A c m , with no loss of generality, be considered detemiinistic [ f 11.

The automaton is considered to befiite if the sets Q, B and A are al1 finite.

The automaton is considered deterministic or stochastic as described klow.

The automaton is a deteminidc autornaton if both F and G are deterministic

mappings. For such an automatoa. given an initial state and input, the next state and

action an uniqwly specified.

2.1.2 Stochastic Automaton

If F or G is stochastic, the automaton is called a stochastic automaton. If the state

transition mapping F is stochastic, given an initiai statc and input sequence, there is no

cenainty regarding the states and actions that follow. We cm ody consider the

probabilities of naching various states. For this reason, F can be specified in ternis of the

conditional probability matrices l?', pz,...* Pm, where each fl for BEB is a sxs mavix

whose envies are given by:

Hence, t6, represents the probability that the automaton moves from state qi to qj on

receiving the input p. Based on the fact that d l Pij are probabilities. and thrt after a

transformation the automaton has to reach one of its states, we have:

and, consequently, is a Markov matrix.

If the mapping G is stochastic, it cm be repnsented by a conditional probability

rnatrix of dimension s x r having the following elements:

gij = Pr(a(t) = a, Iq(t) = q, ), i= l , ..., S. j=l, ... ,r. (2.3)

In the above equation. gj denotes the probability that the automaton chooses q given that

it is in state qi. Since g~ are probabiiities, it implies:

Cg, =1, for each i=l . ..., S. j=l

It can be shown [22] that by a proper radefinition of states, the output lunctioa G

of any stochastic automaton can be made detemiinistic by hcnasing the number of states

in the automaton.

if the conditional probabilities f@, and g, are independent of both t and the input

sequence, the stochastic automaton is called a jked structure stochastic outomuton

(FSSA). if the transition probabilities, eij vary based on the input at each step t, the

automaton is called a variable structure stochastic automaton (VSSA).

if the transition mapping is stochastic, we can not determine precisely the state of

the automaton ai a given time. uideed, we can oniy calculate the probability with which

the automaton is in a panicular state at a given instant. These probabüities are known as

state probabilities. The state probubility vector can be defined as

X(t) = [x, (t). R? (t), ..., X , ( t)lT , w hen nr denotes the transposed mavix and

Given the input and the initial state probability vector X(O), the state probability vector at

t=l is obtained as follows:

Cbaptei 2: LEARNWG AUTOMATA - AN OVERVIEW 15

in vector focm. the equation can be wntten as:

R(1) = [FP'O) r = lt(0).

Recursively, this leads to the state probability vector at time t, as:

Wt) = [F@(~-J) r [FP('-~) )r ...[F P'O) plc(0) .

Simildy, the components of the action probability vector P(t) are defined as:

pi(t) = R(a(t) =ai ) i = 1, ... r

which can be seen to be related to Z(t) by [Il]:

This equation shows that one cm relate the state probability Il(t) to the action

probability P(t), and ihus perform the above cornputations by f i t processing IF(t) and

then computing P(t) using the output mairix.

23 The Leamhg Automa ton

The goal of an automaton is to determine the optimal action fiom a set of possible

actions. The automaton prfonns these actions in a random environment that generates a

response for each action. The following section of this chapter describes the

characteristics of the environment in which a leaming automaton operates.

--

2.2.1 The Envlraunent

The environment (see Figure 2.2) c m be defined mathematically by a triple

(A. C, B) where A=(ai , a2 . . . ., &} represents a finite input set, B=( B 1 , B2 ,. . ., pi}

(2 5 I c w) is the output set of the environment and C=(q , cz ,..., cr} a set of pendty

probabilities, when each element Ci of C corresponds to an input action ai.

Actions Environment 1 Outpuiset B A={a,, +,.-.. 9) C=(c,,~,--.,c,)

Figure 2.2: The environment

At any discrete time t (t=0,1,2,. . .) an input a(t) cm be applied to the environment

which will generate an output b(t). Usually, the output set of an environment has two

elements pi and PZ, which are considered to be O and 1 for mathematical convenience.

As a convention, an output f!(t)= 1 is considered a failure or an unfuvorcrble response or a

penalty and an output p(t)=û is considered a success or afavoruble response or a reward.

The systems that interact with an environment that generates only two output values are

considered P-models. If the output of the environment is a finite set. the systems

interacting with this type of environment are referred to as Q-models. As a M e r

generalization, when the output of the environment is a continuous d o m variable,

which assumes values in the interval [O, 11, the mode1 is referred to as an S-niodel.

The envinmments can also be classified based on their evolutionary properties. If

the penalty probabilities ci (i= 1.2,. . . J) are constant in tirne. the environment is caiied a

Cbrpter 2: L&wwG AUTOMATA - AN OVERWEW 17

stationary environment. However, if one or more penalty probabilities Ci (i= 1 1,. . . ,r) are

not constant. the envuonment is considered nonstationary.

2.2.2 Definition oî the Leamhg Automaton

A "learning" automaton is an automaton that interacts with a random

environment, having as goal to improve its behavior. It is connected to the environment

in a feedback loop, such that the input of the automaton is the output of the environment

and the output of the automaton is the input for the environment. as shown in Figure 2.3.

p ""m Environmen t

Figure 2.3: Feedback connecrion of automaton und environment.

Starting from an initial state q(O), the leaming automaton genentes the comsponding

action MO). The environment generates a response to this action, P(0). which dictates to

the automaton that based on its transition matrix F, it should change its state to q(1). This

cycle is repeated until the probabiiity of choosiag the action that bas the srnailest penalty

probability, hopefiilly, becornes as close to unity as desired.

In order to define some quantitative nomis of behavior for the learning automata,

the automata are considered to operate in a stationary randorn environment with the

penalty probabnities (cl, ~ 2 , . . ., c,). If two automata operate in such an environment. the

automaton that receives a bigger number of favorable responses from the environment is

considered better. To achieve this, the automaton has to leam to choose the "best" action.

where the "best" action is considered the action with minimum penalty probability cmin.

One basic method by which one couid learn to choose the best action is based on

the pure chance approach. if there is no a priori information regarding each action, it is

not possible to distinguish between the diffennt actions. in such case. each action is

chosen with equal probability p(t) = l/r, i=l.2,. . .,r. An automaton that uses ihis leaming

approach is called a "pure-chance automaton" and it is considered a standard for

cornparison of the behavior of the learning automata

In order to compare various leaming automata, the average penalty for r given

action probabiiity vector P(t) at time 't' is defined as:

For the "pure-chance" automaton, the average penalty is calculated to be [Il]:

An automaton is considered better than the pure-chance automaton if its average penalty

M(t) is smaller than MO at least asyrnptoticdy. as t+= . As M(t) and limM(t) are 1 3 -

random variables. one can compare E[M(t)J with W. where

Based on the cornparison with the purethance automaton. any automaton that perfomis

ktter than the pun-chance automaton is considered expedient. Mathematically, this

definition cm be expressed as foilows:

Ddlnition 2.2: A Iearning automaton is considered expedient if

lim E[M(~)] < Mo. t-b-

More strictly, the behavior of an automaton c m be characterized by the following

de finition:

Definition 23: A leaming automaton is sûid to be absolutely expedient if

where M(t) is the expected penalty probability at instant 't', and P(t) is the probability

vector.

The condition of absolute expediency imposes an inequality on the expected

penalty probability M(t) at each instant. Taking expectations again in equation (2. 15)

we obtain [Il]:

Chapter 2; LEMNWG AUTOMATA- AN OvicflVIaW U)

which shows that Ew(t)] is strictly decreasing with 't' in a i i stationary random

environments.

As stated earlier, the goal of any automatod is to leam to asymptotically choose

the best action. An automaton that achieves this goal is considered optimal.

Mathematically, the definition of optimality in the context of leaniing automata is given

as follows:

Definition 2.4: A learning automata is considered optimal if

where

Cmin-

lim p, (t) -f 1 with probability 1, t+-

(2. 17)

pb(t) is the action probability associated with the minimum penalty probability

Unfoctunately, at present. there are no optimd leming automata. ui this case.

one might aim at a sub-optimal performance, termed as eoptimaliîy [33].

Definition 2.5: Let 2, be a learning parameter. A leaming automaton is said to be E-

optimal if for every e > O and 6 > O, there exists t~ > - and 2~ > O such that

for al1 t 2 b and Â.&.

The E-optUnol defuiition Unplies that given enough time and given an intemal

parameter k (usually depending on the numkr of intemal states), the probability of

choosing the best action almost aii the tirne, can be made as close to unity as desired.

It has been shown that if an automaton is absolute expedient then, it is also

e ~ p t ~ m ~ l in al stationary random envkonments [SI.

23 Fixed Structure and Variabk S t ~ c h i r e Leamhg Automata

A variety of leiuning automata have ken proposed, beginning with Tseh ' s

pioneenng paper in 1961 [3L]. Initial leming automata designs had time invariant

transition and output hinctions, king considered "fied structure" learning automata.

Tsetlin, Krylov, and Krinsky [31] [32] presented notable examples of these automata

type. Variable Structure Stochastic Automata (VSSA) were developed later, in which the

state transition hnctions and the output hinctions were tirne dependent [Il] .

23.1 Fixed Structure Automata

23.1.1 Tsetlin Automaton

This was the fmt learning automaton pnsented in the literature 1311. It is a

detercninistic fixed structure automaton, denoted Lw2, with 2N states and 2 actions, i.e. N

states for each action. Funhennore, its s t n r t u r t can easily be extendeci to ded with

r (2 < r c -) actions. As any leaming automata, its goal is to incorporate knowledge fiom

the pst behavior of the system in its decision d e for choosing the next sequence of

actions. To achieve this, the automaton calculates the number of successes and failures

nceived for each action, and switches to the altemate action only when it receives a

suffiCient number of failuns, depending on its cumnt state. ui order to describe the

bebavior of this type of automaton. its output and state transition fùnctions will be

described below.

The output function of the automaton is simple: if the automaton is in a state

qi (1 S i S N). it chooses action ai and if it is in a state qi (N+1 I i S 2N) it chooses action

a?. Since each action has N states associated with it. N is called the memory associated

with each action and the automaton is said to have a total memory of 2N.

The state transitions are illustrated by the two graphs presented in Figure 2.4, one

for a favorable response and one for an unfavorable response. If the environment replies

with a reward (favorable response), the automaton moves deeper into the memory of the

corresponding action. if the environment replies with a pendty (unfavorable response).

the automaton moves towards the outside boundary of the memory of the corresponding

action. The deepest states in memory are refemd to as the most internal states. or the

end states.

Favorable Rcsponsc jbû

Unfavorable Rcspmse pl

Figurr 2.4: State transition graphs for the Tsetlin automaton Lm

The Tsetlin automaton can be analyzeâ using the theory of Markov chains [Il].

From this perspective. it is possible to characterize the states of the automaton as

recurrent srutes, meaning that the automaton can be in any state an infinite number of

times. For example, if the automaton is in state 1, and action 1 is being penalued N

consecutive times, and after that the automaton is rewarded N consecutive times, the

automaton will move from state 1 to state N+1. Similarly, the automaton will move from

the state N+1 to the state 1. Funhermore. the automaton is irreducible because every two

states (qi. qi) communicate [ll]. Another characteristic of this autornaton is that it is

aperiodic. since it cm loop in the states 1 and N+l an arbitrary number of times.

Any finite Markov chah that is irreducible and aperiodic is ergodic 161, and ihis

implies that the Tsetlin automaton is ergodic. This property of the Tsetlin automaton

indicates that the automaton will converge to a state probability distribution

independently of the probability distribution at the staning state.

The expected asymptotic penalty probability for the Tsetlin autornaton was shown

to be

where Ci is the penalty probability of a, ci=l-di, i=1,2 [1 I l . It bas bctn S ~ O W ~ that the

Tsetiin automaton is optimal in al1 environments whenever min (ci, q} S 0.5 11 11.

In 1964, Krinsky made an important step in the learning automata theory by

presenting another deterministic automaton, which was mptimal in al1 environments.

The next section presents this automaton in detail.

This automaton, denoted K ' w ~ , is a detenninistic automaton, and like its

predecessor, the Tsetlin automaton. is an automaton with 2N states and 2 actions. The

output hnction is identical to the output function of the Tsetlin automaton, i.e. if the

automaton is in any state qi (i=1,2 ... N) it chooses action ai. and if it is in yiy state qi

(i=N+L ,N+2...2N), it chooses action az.

The state transition function of Krinsky automaton is similar but not identical to

the state transition function of the Tsetlin automaton. When the environment replies with

a penalty, the automata exhibit the sarne behavior, i.e. they move towards the outside

margins of their cumnt action's domain. The difference lies in the way the automaia

react when the environment rewards an action. Tsetlin had his automata move exactly

one state closer or further from its intemal states for eoch reward or penaîty. The

''philosophy" behind Krinsky's automaton is to give a maximum effect for each reward.

This implies rnoving to the deepest state in the rnemory when the environment rewards an

action, and N consecutive penalties are cequired to change the action of the automaton.

in the case of a favorable response, if the automaton is in aay state qi (i=1,2 ...N), it

passes to state ql and if it is in any state qi (i=N+l,N+2.. .2N), it passes to the state q ~ + l ,

as show in Figure 2.5.

Unfavonble Response &1

Favonblc Rcsponsc

Figun 2.5: State transition graphs for the Krinsky outornaton

The expected asymptotic penalty probability for the Krinsky automaton was

calculated to:

where Ci, i=12 is the penalty probability [Il]. It can be easily shown that the Krinsky

automaton is E-optimal in al1 stationary random environments [Il].

The Tsetlin anci Knnsky automata are considend deterministic automata, since

both their output, and state transition huictions are deteministic. In order to present the

whole class of fixed stmctwe automata. the next section pcesents a leaming automaton

that is stochastic, introduced by Krylov in 1964.

Chapter 2: L-G AUTOMATA - AN OVERVIEW 26

2.3.13 Krylov Automaton

Krylov automaton ( K ~ N , ~ ) is also an automaton with W states and 2 actions, and

has the same output transition hnction as the LW2 automaton. Furthemore, Krylov's

automaton has the same state transition hinction as Lm- automaton but only when the

response of the environment is favorable. The khavior differs when the automaton is

penalized; in this situation, the behavior of the Krylov automaton is stochastic rather than

deterministic. If the automaton is penaiized. it moves towards or outwards it's intemal

states with a probability 0.5, as shown in the Figure 2.6.

Favonblt Responsc @

Unfavonblc Responsc pl

Fi- 2.6: Store transition graphs for the Kdov automason K'WJ

It is important to note that the modification made by Krylov makes the wtomaton

Goptimal in di environmeais. The expected asymptotic penalty probability is shown to

be:

Cbapter 2: LEARNLNG AUTOMATA - AN OVERVLEW 27

where h, = A , i = 1.2. As N increases, the limit becornes 1 -c i

which proves that the automaton is &-optimal (1 11.

Krylov's automaton is a fixed structure stochstic automaton (FSS A). The

concepts of the automata presented above can be extended to cases where the automata

c m perform r (2 S r < -) actions (ai, az.. . a}. The automata with many actions and the

automata with only two actions differ mainly in those states where the automaton

switches from one action to the next. A detailed description of this generalization cm be

found in [32] [Il] ,

23.2 Variable Structure Stoch astic Automata

In search for a greater flexibility for designing automata. Varshavskü and

Vorontsova [34] were the fmt to propose a class of automata that update transition

probabilities; these are called Variable Structure Stochastic Automata (VSSA). The

principal characteristic of this type of automata is that the state transition probabilities or

the action selecting probabiiities are updated with time.

For mathematical simplicity, it is assumed that each state corresponds to a distinct

action. This implies that the number of states s is equal to the number of actions

r (S = r < a) and so, the action transition mapping G becomes the identity mapping.

Varshavslrii and Vorontsova have proved that every VSSA is completely defineà by a set

of action probability updating rules, and so the state transition mapping F becomes

Cbrrptcc 2: LEABNING AUTOMATA - AN OVERWW 28

equivaient to the probability updating d e for the definition of a VSSA. The leamhg

automata operates on a probability vector P(t)=[pi(t).. . . ,p&)]' where pi(t) (i= 1,. . . .r) is

the probability that the automaton will select the action ai at the time t: pi(t)=Pr[a(t)= @].

A mathematicai description of a variable structure stochastic automaton is given below:

Definition 2.6: A variable structure stochastic automaton (VSSA) is a 4-tuple

<AB,T,P>, where A is the set of actions, B is the set of inputs of the automaton (the set

of outputs of the environment), and T:[O.l]'xB+[O, 1)' is an updating scherne such that

where P is the action probability vector. P(t)=[pi(t),pz(t), .... p&)lT. with

t

pi(t)=h[a(t)=~], i=l ,. ...r, and pi (t) = 1 for al1 't'. i=f

As in the case of FSSA, the VSSA can be analyzed using the Markov chah

theory. If the mapping T is independent of time. the probability P(t+l) is determined

completely by P(t), wbicb implies that (P(t)}rn is a discrete-homogenous Markov

process. From this perspective, diffeient mappings T can identify different types of

leaniing algorîthms. If the mapping T is chosen in such a m m e r that the Markov

process has absorbing states, the algorithm is refened to as absorbing algorithm.

Similady, non-absorbing algorithms are Markov processes with no absorbing states.

Ergodic VSS A are suitable for non-stationary environments because their behavior is

independent of their initial states. 'Chathachar and Narendra have presented different

viuieties of absorbing algorithms in [Il]. Ergodic VSSA have been proposed in [ll],

1121, [7]. The goal of a VSSA is to choose a mapping T such that the leaming algorithm

satisfies one of the performance criteria

The VSSA can be classified according to the updating probability hinctional

form. if P(t+l) is a linear function of P(t), the automaton is said to be linear, otherwise it

is considered nonlineut. Occasionally, two or more automata are combined to form a

hybrid automaton. Independent of the fom of the updating scheme, a VSSA follows

some basic leaming principles. If an action has been rewarded. the automaton

incrrases the probability for this action, decreasing the probability for al1 other actions. If

an action bas been penalized, the automaton decreases the probability for this action,

increasing the probability for ali other actions. Depending on the learning pnnciple of its

VSSA. different combinations of updating schemes c m be enumerated as:

RP (Reward-Penalty) - the probabilities are updated when the automaton is

rewarded and penalized.

RI (Rewad-Inaction) - the probabilities are updated when the automaton is

rewarded and are left unchanged when the automaton is penalized.

IP (Inaction-Penalty) - the probabilities are updated when the automaton is

pendized and are left unchanged when the automaton is rewarded.

If the mapping T is a continuous one, the automaton is considend a continuou

outumuton. The VSSA presented originaliy were continuous algorithms. In 1979,

niathachar and Oomrnen introduced discretized versions of learning VSSA [2q, which

have been later extended to yield varieties of absorbing, ergodic and estimator type of

learning automata [14] [17] [19] [18].

A general upàating scheme for a continuous VSSA operating in a stationary

environment wi th b= (O, 1 ) can be represented as follows:

If action is chosen a(t)= q , the updated probabilities are:

r

Because P(t) is a probability vector, it has to satisfy p, ( t ) = 1. which implies that j=i

when p(t) = 0

I;i pi ( t + 1) = pi (t) - h, ( m when P(t) = 1

In the above representation. the functions hj and gj have the following properties:

hj and gj are continuous hinctions (assumed for mathematical convenience [IL])

hj and gj are nonnegative hinctions,

for al1 i=1,2,. ..,r and ai i P(t) whose elements are in the open interval (0,l). As mentioned

above, if the hinctions hj and gj are linear in P(t). the automata are said to be linear.

VSSA are implemented using a Random-Number Generator (RNG). The

automaton decides on wbch action to choosc based on the action p~obability distribution

From the class of linear VSSA, the following three algorithms are relevant

because they express the three main philosophies of leaming: the linear reward-penalty

scheme (LRP), the linear reward-inaction scheme (LRù and the linear inaction-penalty

scheme (Lip) scheme. AU these schemes are explained below for leaming automata with

two actions. Their extension to multiple actions can be found in [L 11.

2.3.2.1 Linear Rewarâ-Penaity S cheme (Lw)

in a linear reward-penalty scheme, the automaton increases the probability for the

action that bas been rewarded, and decmses the probability for the action th* hûs k e n

penaiized. This method of learning gives the following updating equations:

where O d i < l and ûd2<l are the reward and penalty parameters, respectively. These

equations show that whenevcr a probability ~ ( t ) is increased. it is increased with a value

proportional with the distance to 1, namely [lopk(t)]. When a probability ~ ( t ) is

decreased, it is decreased with a value proportional with its distance to O, i-e., m(t)- The

specific case when hi=k2 it is known as the symmetric linear reward-penalty scheme

(Lw)*

The Lw scheme is ergodic. Also, it was shown that the asymptotic value of the

average penalty for the symmeeic LRp scheme is given by:

Cbapter 2: L m AUTOMATA - AN OV~RY~BW 32

2c1c* c, + c, lim E[M (t) J = - <-= t+- c, +c, 2 Mo 9

where ci and cz are the penalty probabilities. This proves that the LRP scheme is

expedient for ail initial conditions, in ail stationary environments. Since the scheme is

ergodic, it is it suitable for non-stationary environments 11 11.

23.23 LiwPr Reward-Inaction Scheme (Lu)

The basic idea of the reward-inaction scheme (LRI) is to keep the probabilities

unchanged whenever the environment replies with an unfavorable response. When a

favorable response is given, the probability of the action is increased as in the LRp

scheme. The updating equations for this scheme cm be derived from the LRp scheme by

choosing the penalty parameter A2 to be 0, and are presented as follows:

These equations indicate that tbis scheme has two absorbing States: [O,lJT and

[ l , ~ ] ~ . For example, if pi(t) becomes unity and action ai is rewarded, the probability

becomes

If the automaton is penalized at this time, then the pmbabüity remains unchaaged siace

the automaton dœs not react to penalties, which hplies that the state [L,o]~ is an

absorbing state. In an analogous marner, it can be shown that [OJf is an absorbing

state. This makes the scheme inappropriate for non-stationary environments. In any

stationary environment, the LRI Scheme has proved to be E-optimal [7].

2.3.2.3 Linear inaction-Penaity S cheme (Lw)

This scheme is based on the principle that the probabilities are updated only when

an action is king penalized and they remain unchanged when an action is rewarded.

This method of learning cm be expressed mathematically as follows:

pl (t + 1) = pi ( 0 if a( t ) =a, and P(t) = O

if a(t) = a, and P(t) = 1

if a(t) = a, and P(t) = O

It has been proved that this automaton is ergodic and expedient [7].

Al1 these schemes can be obtained fiom the equations (2.26) by giving different

values to the learning parameters Li and k2; for example, the Lw scherne cm be obtained

from these equations for hi* and the LRI scherne can be obtained for kz=O.

Lakshrnivarahan and Thathachar [ I l ] [7] studied the general behaviot of a ünear nward-

penalty scheme for diffennt parameters Ai and h2. They have shown that a scheme based

on the equations (2.26) with &(0,1] does not have any absorbing States and the nature

of convergence of {P(t)}m, is similar to that of the LRP scheme. Furthemore, they have

show that for small values of the parameter X2 relative to hi, the equation Eq. (2.26) can

generate an eoptimal scheme [Il]. This led to a LRtp scheme which is ergodic and e

optimul [IL]. This scheme was obtained by adding a small penalty term to the Lw

scheme; i.e. it can be viewed as a LRp scheme where the penalty terms are made small in

comparison with the reward t e m . The importance of this scheme is that it has a marked

ability to be used in non-stationary environments and yet bas good convergence

properties.

Another method used to improve the convergence of VSSA is CO discretize the

probability space. The next section describes this method and presents few examples of

discretized leaming automata

2.3.3 Discretized Lepnihg Automata

Pnor to introducing the concept of discretization, al1 the existing continuous

VSSA permitted action probabilities to take any value in the interval [O.LI. in their

implementation, the leaming automata use Random-Number Generator (RGN) in

determining which action to choose. In theory, an action probability cm take any vdue

between O and 1, so the RNG is required to be very accurate; however, in practice, the

probabilities are rounded-off to an accuracy depending on the architecture of the machine

that is used to implement the automaton.

In order to increase the speed of convergence of these algorithms and to minimize

the ~quinments of the RNG, the concept of discretking the probability space was

inuoduced [26] [19]. Analogous to the continuous algorithms. the discretized VSSA can

be defineci using probabiiity updating huictions, but these huictions cm take values in a

discrete f d t e space. These values divide the continuous [OJ] interval into a finite

number of continuous subintervals. The Discrete Algorithms are said to be linear if these

subintervais bave equal length, otherwise they are cailed nonlinear [%].

Like the continuous learning automata, the discretized learning automata can k

analyzed using the theory of Markov chahs, and can be divided into two categories:

ergodic or absorbing.

Following the discretkation concept, many of the continuous variable structure

stochastic automata have been discretized. Various discrete automata have ken

presented in liierature [19] 1171 [14] [26]. Al1 of the linear VSSA presented in the

previous section have corresponding discretized versions: Le. the discretized linear

reward-penalty automaton (DLRP), the discretized Linear nward-inaction automaton

(DLRd, and the discretized linear inaction-pnaity automaton (DLw) [26] 1141 1171. The

concepts of discretized automata will be presented in the following sections by

demonstrating the simil~t ies and dissimilarities between some continuous automata and

their discrete counterparts.

233.1 Discmtized Linear Rewa rd-Inaction Automaton

The àiscretized hear reward-inaction automaton @LR3 was the fmt discretized

automaton presented in the Literam [26]. In the following description of this automaton,

only two actions are considered, but the same concept appiies to a r-action (2acm)

automaton. nie basic idea of the leaming aigorithm is to makt discrete changes in the

action probabilities. The probability space [0,1] is divided into N intervals, where N is a

resolutioa parameter and is recommended to be an even integer. Since it is a reward-

inaction automaton. the updating equations do not modify the action probability vector

when the environment penalizes the automaton. When the response from the

environment is a reward, the automaton increases the probabiiity of the action that has

been chosen and decreases the probability for al1 the remaining actions.

The discretized automaton has a state associated with every possible probability

value, which determines the following set of states: Q=(qi,qz,. . . ,q~] . in every state qi, the

probability that the automaton chooses action a, is VN and the probability to choose

action a? is (1- QN). The state transition map is defined by the following equations:

q(t + 1) =cli+, . if a(t) = a, and P(t) = O

q(t+ 1) =qi-i 9 if a(t) = a, and P(t) = O

q(t+l)=qi , if a ( t) = a,or a, and P(t) = 1.

where q(t)=qi t q o or q ~ . It cm be seen that both qo and q~ are absorbing states. Based on

the probabilities associated with each state, the automaton can be described entinly by

the following action probability updating equations:

The algorithm starts with the initiai action probability vector P(0) =

nsolution ppramter N.

These equations indicate that {P(t)) khaves iike a homogenous Markov chah

with two absorbing states: [ L . o ] ~ and [O,llT. The algorithm bas been proven to be E-

optimal in dl environments [tg]. The difference between this algorithm and its

continuous version is in the rate of convergence. Oommen and Hansen have performed

simulations of the LRI and DLRl automata. and in al1 scenarios, the DLn1 automaton is

superior to the LRI automaton [19]. Their studies indicate rhat when the two automata

w e n made to learn the best action in an environment with ~ ~ - 0 . 2 and ~ ~ d . 6 , in 240

iterations the LRI automaton gave only an expected vdue of 0.99982. The DLRl scheme

gave an expected value of 0.99999 and subsequentiy the value stayed at unity. if a

stopping critenon was used, it was seen that the DLRI automaton reached 0.99 accuracy in

125 iterations and the LR1 automaton reached the same accuracy in 135 iterations.

in [14], Oornmen compared DLRi with some deterministic automata. The Table

2.1 shows a cornparison between the performance of various learning automata.

Table 2.1: Erpcrhentul comparative pe@ominnce of DLn, with other FSSA [14].

Tsetlin

pl(-) N E[pi(d

From these results, Oornmen concluded that for environments with cl > 0.5, the

DLru Mean Time For

Convergence (No. iteratbns)

DLRl is more accurate than the Tsetlin automaton. Furthemore, for a fixed N, as the

difference between the penalty probabüities is decreased, DLRi becornes mon accurate

than the T s e h and Knnsky automata. In al1 these environments, Oornmen observed that

the DLRI automaton is faster than the TseUin and Krinsky automata. It was later shown

that the DLRl scheme is E - o p t i ~ f in al1 random environments 1141.

2.33.2 Discretizeù Linear Inaction-Penalty Autamaton

Following the s a w methoci of discretization used for the reward-inaction

algorith, a discretued version of the linear inaction-penalty algorithm, denoted DLip,

was developed [14]. This algorithm bas been proved ergodic and expcdient in ai i random

environments 1141. A later ariiticialiy created absorbing version of this algorith, the

absorbing discretized linear inaction-penalty automaton, denoted ADLp, was the fmt

inaction-penalty algorithm proved to be ~ o p t i m a f [14]. Although the ADLW automaton

is &-optimal. simulation results have shown that this scheme is very accurate but slow in

convergence [14]. When the penalty probabilities are high. the automaton utilizes many

more responses of the environment than a reward-inaction automaton. Table 2.2 pnsents

some comparative results between the ADLa, the DLR1 and the ADLRp automata.

The updating d e s for the DLlp algorithm are defined in the following equations:

if a(t) = a, or a,. $(t) =O

2.3.3.3 Discretîzeà Linear Reward-Pendty Automaton

The discretized lhear reward-penalty automaton (DLRP), as i ts continuous

version, reacts to both reward and penalty responses of the environment. Similarly to the

DLRl and the DLrp automata, DLRp updates its action pcobabiîities in steps of size 1/N,

where N is a resolution paramter. The upùathg mles for this automaton are given by

the following equations:

pl (t + 1) = min l,p, (t) +- , if a(t) =al $0) =Oora(t) = a, $(t) = 1 I NI

Oommen and Chnstensen 1171 proved that the DLRp automaton is ergodic and E-

optimal in ai l random environments whenever ch, < 0.5. They also showed that by

making a stochastic modification to the transition function, the automata cm be made

ergodic iind e-optimal in ai l random environments. This modified version of the DLRp

automaton is known as the modifed discrete h e u r reward-penalty automaton, MDLRPI

and is the oniy known ergodic linear reward-penalty scherne, which is &optimal in al1

random environments. Oommen and Christensen created an absorbing version of the

DLRp automaton denoted NILRP. They showed that a discretized two-action linear

nward-penaity automaton with artificially created absorbing barriers is E-optimal in al1

random environments. It is the only symmetric E-optimal leaming automata known.

Simulation results indicated that the ADLRp scheme is extremely accurate and fast in

convergence [ 171.

Oommen and Chnstensen have also made a comparative study of the performance

and accuracy of some of the discrete hear automata [17]. The results of this study,

prcsented in Table 2.2. show that the ADLnp scheme is supenor based on counts of both

speed and accuracy.

Table 2.2: Comparative petjiormance of &, ADL,& ADLRP (c2=0.8)

These results indicate. for example, that if N=10, ci=0.6 and ~ ~ a . 8 . the DLRi

scheme converges with an expected accuracy of 0.855, and the mean time to converge

(M.T.C.) was 25.58 iterations. With the same parameters, the ADLP scherne converged

with a greater accuracy (0.93) but the rnean time to converge was much bigger, 499.1 1

iterations. The results for the ADLRp have shown that it converged with an accuracy of

0.93 and the mean time to converge was 32.45 iterations. The followhg table

summarizes the convergence characteristics of aii the VSSA, in their continuous and

discrete fonns.

T W e 2.3: Compatison between continuous a d discrete heur VSSA

continuous

discrete

continuous

discrete

discrete

con tinuous

discrete

discrete

discrete

Matkov Chain

Characterization

absorbing

absorbing

ergodic

ergodic

absorbing

ergodic

ergodic

absorbing

ergodic

Convergence Behavior

- ---

&-optimal in al1 environments


expedien t

expedient

&-optimal in dl environments

expedient in al1 stationiiry env.

&-optimal if ~ ~ ~ 4 . 5


&-optimal in dl environments

Although in this section we presented only linear schemes. it is important to note

that discrete nonlincar schemes have dso been developed [14]. A description of these is

omitted here in the interest of brevity. The next section presents a new category of

leaming algorithms, the estimator algorithms.

Cbapter 2: L-G AUTOMATA - AN O w t ~ v i ~ W 43

2.4.1 Overview

in the quest to design faster converging leaming algorithms, Thathachar and

Sastry have opened another path by introducing a new class of aigorithrns, called

estimator algorithms [28]. The main feanup of these algorithms is that they maintain

mnning estimates for the penalty pmbability of each possible action and use them in the

probability updating equations. The purpose of these estimates is to crystallize the

confidence in the rewwd capabilities of each action. In their characteristics, these

algorithms mode1 the behavior of a person that is trying to choose an action in a random

environment. In this task, the most cornmon and simple approach is to try each action a

number of times and to estimate the probability of reward for each action. The person

will most likely choose the action that has the highest reward estimate; however, uniilce

straighdorward estimation, the superior actions are dso chosen in a more likely manner

in the estimation process.

From this perspective, al1 the dgorithms presented in the pmvious sections are non-

estimator aigorithms. The main difference between the estimator algorithrns and the non-

estimaior aigorithms lies in the way the action probability vector is updated. The non-

esbator algorithxns update the pmbabiiity vector based directly on the response of the

environment. If the chosen action is rewarded, then the automaton increases the

probability of choosing this action at the next time instant. Otherwise, the action

probability of the selected action is decnased.

The estimator algonthms are characterized by the use of the estimates for each

action. The change of the probability of choosing an action is based on its current

estimated mem reward, and possibly on the feedback of the environment. The

environment determines the probability vector indirectly, thmugh the calculation of the

reward estimates for each action. Even when the chosen action is rewarded, there is a

possibility that the probability of choosing another action is increased.

For the definition of an estimator leaming automaton, a vector of reward

estimates d(t) must be introduced. Hence, the state vector Q(t) is dcfined as

Q(t) =< ~( t ) ,d ( t ) >, were &t) = [i, ( t ) ..a, (t)]' [29]

Thathachar and Sastry have shown that the estimator algorithms exhibit a superior

speed of convergence when compared with the non-estimator aigorithrns 1291. in 1989,

ûommen and LanctBt introduced discretized versions of the estimator aigorithms, and

have shown that the discretized estimator algonthms are even faster than their continuous

counterparts [ 101.

This section describes the class of continuous estimator algorithms.

Thathachar and Sastry [27] introduced the concept of estimator algorithrns by

presenting a Pursuit Algorithm that implemented a reward-penalty learning philosophy,

denoted CPRP. As its name reveals. this aigonthm is characterized by the fact that it

pursues the action that is cumntly estimated to be the optimal action. The algorithm

achieves this by increasing the probability of the current optimal action if the chosen

action was either rewarded or penalized by the environment.

The CPRP algorithm involves three steps [27]. The fmt step consists of choosing

an action a(t) based on the probability distribution P(t). Whether the rutomaton is

rewarded or penalized, the second step is to increase the component of P(t) whose reward

estimate is maximal (the cumnt optimal action), and to decrease the probability of d l

the other actions. The probability of the current optimal action is increased wiih a value

direct proportional with the distance to the unity, narnely 1-p,,,(t). AU the other

probabilities are decreased proponionally with the distance to zero, i-e. pi(t). The last

step is to update the mnning estimates for the probabiüty of king rewarded. For

caicuiating the estimates, two more vectors are ina'oduced: W(t) and Z(t), where z(t) is the number of times the 1 i action bas been chosen and Wi(t) is the number of times the

action has been rewarded. Then, the estimate vector d(t)can be calculated using the

iollowing formula:

n wi (t) di(t) =- for i=1,2,. . . ,r

Zi 0 )

Since the nst of the thesis will deal with Pursuit and estimator algorithms, we

fomally present the algorithm below.

Parameters A the speed of l e d g parameter, where O c A < 1. rn index of the maximal component of a(t), d,(t) = max{di (t)} .

i = L r

Wi(t) the number of times the ?' action has ken rewarded up to the time t. with 1SQ. Z(t) the number of times the ih action has been chosen up to the time t, with 1 % ~

Methd Initidization pi(t)=l/ï, for 1 5 i S r

Initialize d(t) by picking each action a smdl nurnber of times. Repeat

Step 1: At time t pick a(t) according to probability distribution P(t). Let Nt) = ai. Step 2: If & is the cumnt optimal action. update P(t) according to the following

equations:

Step 3: Update &t) according to the foiiowing equations:

wi (t + 1) = Wi (t) + (1 - p(t))

Zi(t+ 1) =ZJt)+ 1

Cbapter 2: L-G AUT~MATA- AN OVERWW

End R e p t END ALGORITHM CPRP

The CPRp algonthm is sirnilar in design to the LRp algorithm, in the sense that

both algorithms modify the action probability vector P(t) if the nsponse fiom the

environment is a reward or a penaity. The diffennce occurs in the way they approach the

solution. The LRp algorithm rnoves P(t) in the direction of the most recently rewarded

action or in the direction of dl the actions not penalized, whereas the CPRp algorithm

moves P(t) in the direction of the action which has the highest nward estimate.

Thathachar and Sastry proved that this algorithrn is eoptimul in every stationary

environment. Also, comparing the performance of the CPRp and LRI automata, the

authors have shown that the CPRp algorithm converges up to seven times faster than the

LRI automaton [27].

2.4.2.2 TSE Algorithm

Thathachar and Sastry in 1281 introduced a more sophisticated estimator

algorithm, which we refer to as the TSE Algorithm. Being an estimator algorithrn, it

considers the reward estimates in calculating the action probability vector. The algorithm

incnases the probabiIities for al1 the actions that have a higher estimate than the estimate

of the chosen action, and ~ C C R ~ S the probabilities of dl the actions with a smaller

estimate. The probabilities are updated based on both the reward estimates d(t) and the

action probability vector P(t), as shown below, in the detailed of this algorithm.

ALGORITHM TSE

Parameters k the speed of leming parameter, when O < A c 1. m index of the maximal cornponent of &t), dm (t) = mu{ a i (t) } .

i=t,..r

Wi(t) the number of times the rJh action has been rewarded up to the time t, with 1 L i L r Z(t) the number of times the ib action has been chosen up to the time t, with 1 5 i I r

1 if ;Li(t)>dj(t) Sij(t) an indicator function S, ( t) =

O if di(t) ~ d , ( t )

f E[- 1. II+[- 1.11 a monotonie, increasing function satisfying f(O)=û Method lniUalizaUon pi(t)=l/r, for 1 5 i S r

initiaiize d(t) by picking each action a smdl number of times. Repeat

Step 1 : At time t pick a(t) according to probability distribution P(t). Let Mt)= a. Step 2: Update P(t) according to the following equations:

Step 3: Update &t) according to the foliowing:

wi (t + 1) = Wi (t) + (1 - ~ ( t ) ) Zi ( t+l )=Zi( t )+l

End Repeat END ALGORITHM TSE

It is important to notice that P(t) depends indirectly on the response of the

environment. The feedback fiom the environment changes the values of the reward

estimate vector, which affects the values of the functions €and Sij.

The detaiied description of the algorithm indicates that if the action is

rewarded, dl the probabilities pj(t) that comspond to actions with reward estimates

higher than the reward estimate di (t) are updated using the following equation:

Since d, (t) < d (t) , the sign of the function f (d ( t ) - a, (t )) is nepative, and chus, the

probability pj(t) increases proportional to (1-pj(t)).

For al1 the actions with reward estimates smaller than ii (t) , the probabilities are

updated based on the foiiowing equation:

The sign of the function f(i , (t) - d, (t)) is positive, whkh means that the action

probability pj(t) is decreased proportionally to pj(t)-

The probability action pi(t) is incmased or decreased to ensure that the sum of al1

the action probabilities is 1. When al1 the reward estimates are higher than the reward

estimate of action a, the automaton increases ali the probabilities pj(t+l). To ensure that

the arnount that this probabilities increase does not sulpass the value of pi(t), Thathachiu

pi (t) in the updating equations. and Sastry introduced the tenn - r-1

There are two main differences between the hinuit Algorithm and the TSE

Algorithm. Fit, they differ in the method of deciding which probability actions are

increased and which are decreased. The Pursuit algorithm increases only the probability

of the action comsponding to the highest estimate, whereas the TSE Algorithm increases

the probabilities for al1 the actions with a highei nward estimate that the estimate of the

chosen action. Second, their updating cquations differ. in increasing or decreasing a

probabiiity, the TSE Aigorithm considers ais0 the distance between estimates,

incorporated in the term f (ai (t) - d,(t)), whereas the Pursuit Algorithm takes into

account only the distance between the probability at time t and the probability that it aims

for each action, O or 1.

This aigorithm bas k e n shown to be eoptimul 1291. Also, Thathachar and Sastry

presented simulation results in cornparison with the LRI scheme. They have show that

for the same level of accuracy, the TSE Algorithm often converges at least seven times

faster than the bI scheme.

The Discrete Estimator Algorithms (DEA) were introduced as an approach to

mate even faster converging leaming algorithm [9] 1101. They emerged from applying

the discretization "philosophy" to the existing estimator algorithms. In this way, the

action probabilities are aiiowed to take a finite set of values and they use the reward

estimates in their updating niles.

LanctSt and Oomrnen have defined a set of properties that every discrete

estimator algorithm must posses [9] [ 101. These properties are known as the Property of

Moderation and the Monotone Property.

Property 1: A DEA with r action and a resolution parameter N is said to posses the

properzy of moderation if the maximum magnitude by which an action probability can

decrease per iteration is bounded by VrN.

The monotone property cm be stated as below:

Property 2: Suppose there exists an index rn and a time instant b < -, such that

i, (t) > d (t) for al1 jmi and al1 t 5 to. A DEA is said to posses the Monotone Propew if

then exists an integer No such that for ail solution parameters N > No. pm(t) -+ I with

probability one as t + m.

This means that if the estimate of reward of an action a,,, remains the maximum

estimate after a certain point in t he , then a DEA bas the monotone property if it steadily

increases the probability of choosing a,,, to unity.

These properties are necessary in proving that a discretized estimator algonihm is

eoptimul. Lanctôt and Oommen have proved chat any discretized estimator algorithm

possessing both of these properties is eoptimal [9] [IO].

The discrete versions of the Pursuit Algorithm and the TSE Algorithm are

presented in the next sections of this chapter.

In 1989, Lanctôt ruid Oommen introduced a discretized version of the Pursuit

Algorithm [9] based on the nward-inaction learning "philosophy", denoted DPRi. The

differences between the discrete and continuous version of the Pursuit algonthms occur

only in the updating d e s for the action probabilities. The discrete Pursuit algorithm

malces changes to the probability vector P(t) in discrete steps, whereas the continuous

version uses a continuous function to update P(t). Being a reward-inaction algorithm, the

action probability vector P(t) is updated only when the current chosen action was

rewarded. If the c m n t action is penalized, the action probability vector P(t) remains

unchanged, which implies that the algorithm uses the estimates in updating the action

probability vector P(t) only if the environment rewards the chosen action. When the

chosen action is nwardeâ, the algorithm dec~ases the probability for al1 the actions that

do not correspond to the highest estimate, by a step A, where A=l/rN. In order to keep

the sum of the components of the vector P(t) equal to unity, the DPlu increases the

probability of the action with the highest estimate by an integral multiple of the smallest

step size A. A description of the algorithm is given below :

Parameters n index of the maximal component of &t), d m (t) = max(di (t)) .

i=l ,... r

Wi(t) the nurnber of times the f' action has been rewarded up to the tiw t, with 1 S i 5 r z(t) the number of times the ith action has been chosen up to the the t, with 1 5 i I r N resolution parameter A A = IlrN is the smallest step size Method Iniüaiizstion pi(t) = llr, for 1 S i S r

initialize d(t) by picking each action a s m d number of times. Repeat

Step 1: At time t pick a(t) according to probability distribution P(t). Let a(t) = a,. Step 2: Update P(t) according to the following equations:

if B(t) = O and pm(t) # 1

Else pj(t+I)=pj(t) for al1 1 S j Sr.

Step 3: Update the reward estimate vectord(t) (same as in the CPRP) End Repeat END ALGORITHM DPru

Oommen and LanctBt proved that this algorithm satisfies both the properties of

moderation and monotonically [18]. Also, they have shown that the algorithm is E

optimal in evecy stationary random envimnment [ 1 81.

Oommen and Lanctôt puformed simulation of the DPRI in some benchmark

environments and the results have been compared against the results of the CPRp

algorithm. The results have shown that in some difficult envinmments, the DPRi requirrs

o d y 50% of the number of iterations required for its continuous version. In a ten-action

envhmnent, the DPRI algorithm required 69% of the iterations required by the CPRI

[ l a

2.433 Dkrete TSE Algorithm

Oommen and Lanctôt also pnsented the discretized version of the TSE

Aigorithm. denoted DTSE [IO]. As the authon have said, "the design of this algorithm

is merely a compromise between the necessity of hnving the algorithm posses the

moderation and monotone properties while possessing as many qualities of the

continuous aigorithm as possible" [ 101.

Oommen and Lanctôt have justified the transformation of the TSE Algorithm into

a discretized one, by analyzing each factor that is part of the updating rules of the TSE

Aigorithm. The parameter h, representing the maximum that the continuous probability

cornponent can change. has been replaced b y the in teger O. The tenn f (i (t ) - a ( t )) was

preserved in the DTSE Aîgorithm, representing a factor of the difference between the

reward estimates. The term (t)pi (t) + si, bas been transfom~d in r-1

Sij(t) +Sji(t)- by eliminating the continuous dependency on the probability r -1

vector, as part of the discntization process.

~ h e muitant factor 8 f(di(t) -i,(t) -!-) determiner the r-1

maximum value that a probabüity WU be incnased or decreased In order to make these

changes discrete, the above factor has to be represented in terms of the number of A-

steps that pceserve also the probabilities in the interval [OJ]. To do this. two new

huictions have been introduced: Rnd(x) and Check(pi(t),pj(t)~). The Rnd() hinction

rounds up its parameter so that its value is dways an integer. The Check() fuaction

calculates the largest integer multiple of A, between 1 and x that cm be added to pi(t) and

subtracted from pj(t). and which simultaneously preserves these probabilities in the

interval [O, 11.

The algorithm modifies fint the value of the action probability that has the

highest rewvd estimate to guarantee that this value will always incnase. A description

of the algorithm is given bellow.

ALGORITHM Discrete TSE

Parameters m, Wi(t), Z,(t), SS,(t) are the same as in the TSE Algoriihm A = l/rN0, with 0 (an integer) k i n g the maximum any component cm change by. Rnd(x) rounds up x to one of (-0, -0+ 1, -0+2,. . . ,O- 1.0 ) . Ckck(pi(t),pj(t), x) r e m s the largest integer w I x such that

O S pi(t) + wA, pj(t) WA S L . Method Initialization pi(t) = llr, for 1 S i i; r

Initialize d(t) by picking eacb action a small number of times. Repeat

Step 1: At time t pick a(t) according to probability distribution P(t). Let Nt) = ai.

Step 2: Update P(t) according to the fdowing: For each action j, starhg with m Do

pi (t + 1) = P (t) - A (t), (t), change)

pi (t + 1) = pi (t) + A check(pi (t). (t), change) (2.45)

Step 3: Same as in tbe TSE Algorithm. End Reput END ALGORITHM Discrete TSE

Although the algorithm is a complicated one. Oommen and Lanctih proved that it

possesses both the modecation and the monotony properties which makes it E-optimal

E 181.

The authors have dso performed simulations in order to study the perfomance of

the discrete TSE in cornparison to its continuous version. Some of the resuits are

presented in the following table:

Table 2.4: nte number of itemtions until convergence in twosction environments for the TSE Algo nthm [IO]

I Pmbabiiity of Reward Action 1 Action 2

Mean Iterations

Continuous Discrete 1

In their simulation, the algocithrns were requhd to reach a standard rate of

accuracy of making no e m in convergence in 100 experirnents. The initiakation of

the reward estimate vector has been done in 20 iterations, which are included in the

results shown in the Table 2.4. The results have shown that the continuous TSE

algorithm was from 4 to 50% slower that the discrete TSE. For example, with di=0.8 and

d2=û.77s, the TSE aigorithm took an average of 8500 iterations to converge and the

DTSE required only 5600.

Some cornparisons were made between al1 the estimator algorithms, in two

benchmark ten-action environments.

Table 2.5: Cornparison of the discrete and continuous esrimator algorithm in benchmark ren- action environments [IO/.

EA Pursui t 1140 799

EA TSE 1310 207

1 TSE

Note: The reward probabilities used a:

EA: 0.7 0.5 03 0.2 0.4 0.5 0.4 0.3 0.5 0.2

EB: 0.1 0.45 0.84 0*76 0.2 0.4 0.6 0.7 0.5 0.3

These environments were the same ones used to compare the continuous

estimator algorithms to the LRI scheme. The estimator algorithms sampled al1 10 actions,

10 times eacb, to initialize the estimate vector. These 100 extra iterations are included in

the results presented in the Table 2.5. As the simulations of the two-action estimator

algorithm showed, the continuous version of the TSE algorithm is slower than the

discrete version; for example, in the environment refemd as EA, the DTSE takes 207

iterations to reach the end-state and the TSE takes 3 10.

These results also show that the TSE algorithm is faster than the Pursuit

Algorithm. In the same environment, the continuous Pursuit Algorithm required 1140

iterations to converge and the TSE algorithm required only 3 10. The same observation

applies to their discrete versions.

2.5 Condusions

In this chapter, different venions of known leaming automata were introduced

and presented in a comparative fashion, following their historicd development. The

main concepts were introduced at the beginning, followed by the description of the Fixed

Structure Stochastic Automata. These class of automata was exemplified by presenting

the Tsetlin's, Krinsky's and Krylov's automata. The concepts of the Variable Structure

Stochastic Automata were described afterward, and different ergodic and absorbing

versions of such automata were presented, such as LRI, LRP, LIP. Attention was focused

on the probability of converging to the optimal action in different environmenis, and on

the performance of these algonthms.

The discretization process was pnsented as a subsequent step in the evolution of

the leaming automata. By discntizing the probability space, and by approaching the

optimal solution in discrete steps instead of using continuous huictions, the convergence

of the discrete learning automata improved considerably. Discrete versions of existing

VSSA were presented, such as DLR[, DLRP, DLip. The main diffennces between these

discrete aigorithms and their continuous versions were explained.

The concepts of the Estimator Algocithms were introduced and exemplified by

pnsenting the Pursuit and the TSE algorithms, with their continuous and discrete

venions. Existing experirnental results regarding the convergence of these algorithms

were presented, and a comparative study was performed based on these results.

This chapter estabüshes the foundation for the new algorithms and results that

wiil be introduced in the following chapters.

Chapter 3: NEW PURSUIT ALGORITHMS~

3.1 Introduction

The estimator aigorithm that update the action probability vector P(t) based

solely on the running estimates consider only the long-tenn properties of the

Environment, and no consideration is given to a short-tenn perspective. In contrast. the

VSSA and the FSSA rely oniy on the short-tenn (most recent responses) properties of the

Environment for updating the probability vector P(t).

Besides these methods of incorporating the acquired knowledge into the

probability vector. another important charactenstic of the leanung automata is

represented by the philosophy of the leaming paradigm. For exampie, by giving more

importance to rewards than to penalties in the probability updating mles. the learning

automata can considerably improve their convergence properties. In the case of the linear

schemes of the VSSA, by updating the pmbability vector P(t) only if the environment

rewarded the chosen action, the linear scheme Lni becam mptimal, whereas the

' The work prrsmtcd in this chaptcr has ken published in "A C o m p N o n of Continuuus and Discre~ized Pursuit Leaming Schelllcs", authorcd by I. B. Oommen and M. Agache [la,

Chaptcr 3: NEW Rrasurr ALCORITEMS 61

symmetric linear Reward-Penalty scheme, LRP, is at most expedient. Also, by

considerably increasing the value of probability changes on reward in cornparison to

changes made on penalty, yields a resultant lineu scheme, the LR-@, which is &-optimal.

The same behavior can be observed in the case of the FSSA. The difference between the

Knnsky automaton and the Tsetlin automaton is that the Krinsky automaton gives more

importance to the rewards than to the penalties. This modification improves the

performance of the finsky automaton, making it E-optimal in ail stationary

environrnents, whereas the Tsetlin automaton is E-optimal only in the environments

where min{ci,c2) < 0.5 [3 11.

In this thesis, we argue that the automaton can mode1 the long-tenn behavior of

the Environment by maintainhg running estimates of the reward probabilities.

Additionally, we contend that the short-term perspective of the Environment is also

valuable, and we maintain that this information resides in the most recent responses thût

are obtained by the automaton. We present leaming schemes in which both the short-

rem and long-tenn perspectives of the Environment can be incorporated in the leaming

process. Specifically, the long term information is crystallized in the running reward-

probability estimates. and the use of the short term information is achieved by

considering whether the most recent response was a reward or a penalty. Thus, when

short-tenn perspectives are considered, the Reward-Inaction and the Reward-Penalty

leaniing paradigms become pertinent in the context of the estimator algorithm.

The Rusuit algorithm intmduced by Thahachar and Sastry [27] considend only

the long-term estimates in the probability updaiiag des . Lacer, ûommen and Lanctôt

Chapter 3: NEW PURSUIT A L C O i t ï ï t i ~ 62

extended the Rusuit Algorithm into the discretized world by presenting the Discretized

Punuit Algocithm. which considered both long-terni and short-tem esthates, and

implemented a reward-inaction learning philosophy [ 181. The combination of these

learning "philosophies" and paradigms, in conjunction with the continuous and discrete

computational models, le& to four versions of Punuit Lcaming Automata, listed below:

i)

ii)

iii)

iv )

Algorithm CPU: Continuous Pursuit Reward-Penalty Scheme

Paradigm: Reward-Penalty; Probability Space: Continuous

Algorithm CPW: Continuous Pursuit Reward-Inaction Scheme

Paradigm: Reward-Inaction; Probability Space: Continuous

Algorithm DPW: Discretized hrsuit Reward-Penalty Scheme

Paradigm: Reward- Penalty; Probability Space: Discretized

Algorithm DPm: Discretized Pursuit Reward-Inaction Scheme

P d g m : Reward-Inaction; Probability Space: Discretized

The CPRp and DPRi algorithms were presented in the previous chapter. This

chapter focuses on presenting the Continuous Rewardd-Inaction h u i t Algorithm and the

Discretized Reward-Penalty Pursuit Algorithm. A comparative study of the performance

of the al1 the Pursuit Algorithm is also presented.

3 3 Continuous Rew~=Iaaction Pursuit Algorithm (CPd

The continuous reward-inaction Pursuit Aigorithm npresents a continuous

version of the Discretized Pursuit Algorithm @PRù presented by Oommen and Lanct6t

[18]. It follows the same leanhg paradigm as the DPRI aigoritin, but in a continuous

probability space. If compared with the CPRP, the CPRl algorithm differs only in the

updating probabiüty rules. Being a Reward-Inaction algorithm. it updates the action

probability vector P(t) only if the current action is rewarded by the environment. When

the action is penalized. the action probability vector remains unchanged.

A formal description of the algorithm is given as follows:

ALGORITHM CPai

Parameters h m, e,, Wi(t)? &(t) : Same as in the CPRp algorithm. Methd Initiaiize pi(t) = llr, for L S 1 < r lnitialize d(t) by choosing each action a small numkr of times. Repeat

Step 1: At time t choose Nt) according to the probability distribution P(t). Let a(t) = a.

Step 2: Ua, is the action with the current highest rewûrd estimate, update P(t) as: If p(t) = O Then

P(t+l) = (1-Â) P(t) + k e,,, (3. 1) Else

P(t+ 1 )=P(t) Step 3: Update d(t) exactiy as in the CPRp Algorithm

Enà Repeat END ALGORITHM CPm

Similarly to the CPRP, the CPRl algorithm can be proven Goptimal in any

stationary environment. The proof for the E-optimaliry of the CPRi foîiows the same idea

as the other Pursuit algorithms. Fit. it can be shown that using a suficiently small

value for the leaming parameter k, al1 actions are chosen enough number of times so that

dm (t) wiii remain the maximum eiement of ihe estimate vector& t) d e r a finite time.

Mathematically, this can be expnssed as follows:

Theorem 3.1: For m y given constants 6 > O and M < -, there exist A* > O and <

such that under the CPRi aigorith, for ail XE (O, no),

Pr[Ail actions are chosen at least M times each More time t] > 1-6, for al1 t 2 t~

Proof=

Let us define the random variable Y', as the number of times the i'h action was chosen up

to time 't' in any specific realization. From the updating equation (Eq.(3.1)), at any step

't' in the algorithm. we have:

pi(t) 2 pi(t- 1 )O( 1A)

which implies that during any of the f iat 't' iterations of the algorithm

R{Q is chosen} 2 pi(0)-(1-A).)'. (3.3)

With the above clarified, reminder of the prwf is identical to the proof for the CPRp

algorithm and omitted 1271. + + * It cm be shown that if there is an action a,,,, for which the reward estimate

remains maximal after a finite number of iterations, then the mh component of the action

probability vector converges in probability to 1 (see Theorem 3.2).

Tbeorem 3.2: Suppose that there exists an index m and a time instant b c = such that

then pm(t)+l with pmbability 1 as t + -.

Chopter 3: NEW Püi~Süïï ALGORmrnts 65

Prooh

To prove this result, we shall demonstrate that the sequence of random variables

(p,,,(t))- is a submartingale. The convergence in probability WU result then fiom the

submartingale convergence theorem [ 1 11.

Based on the assumptions of the theorem. p,(t+l) can be expressed as:

pm(t+l) = pm(t) + L(1-pm(t)). if p(t)=û (i.e., with probability d,,,) (3- 4)

pm(t+ 1 ) = pm(t). if B(t)=L (i.e., with probability la,,,)

where dm is defined as km, and cm is the initial penalty probability for the action a,,,.

Then, the quantity:

A prn(t) = E[prn(t+l)- prn(t) 1 Q(t)l (3- 5 )

becomes

A pm(t) = drnA *(l-p,(t)) 2 0, for ail t 2 b, (3,6)

which implies that p,(t) is a submartingale. By the submartingde convergence theorem

[l 11, (pm(t)} converges as t + 00,

E[pdt+l)- p,(t) I Q(t)]+ O with probability 1. (3.7)

Hence, p,(t)+l with probability 1. and the theocem is proven. + * +

Finally, the Theorem 3.3 expresses the Gop~iniality convergence for the CPRi

algorithrn, and cm be easily deducted from the two previous results.

Tbeorem 3.3: For the CPRI aigonthm, in every stationary random environment, there

exists a k* c' and t&l, such that for ai i k (O, A*) and for any & (0, 1) and any EE (O, 1).

Pr[p,(t)>l-~]>1-6, forallt>b.

In order to stuày the performance of this algorithm, simulations and cornparisons

against different estimator algorithm were perfonned. The nsults are presented in the

Section 3.4.

3 3 Discretized mnl-penalty Pursuit Algorithm @PW)

A new Pursuit algonthrn, denoted DeRp, is obtained by combining the strategy of

discretizing the probability space with the Reward-Penaity learning pmdigm in the

context of a Rvsuit "philosophy". This aigorithm uses the estimates in updating the

action probability vector P(t) if the environment penalizes or rewards the chosen action.

As any discrete algorithm, the DPRp algorithm performs changes to the action probability

vector P(t) in a discrete fashion. Each component of P(t) can change with a value which

is a multiple of A, where A=llrN. r is the number of actions and N is a resolution

parameter. Fonnally. the algorithm can be expnssed as follows:

Parameters m. Wi(t), &(t), N and A : Same as in the CPRp algorithm.

Methd Initiake pi(t) = Ur, for 1 S i 5 r Mtidize &t) by choosing each action a small number of times. Repeat

Step 1: At time t choose Nt) according to probability distniution P(t). Let a(t) = a(.

Step 2: Update P(t) according to the foilowing equations: If $(t)=O and p,,,(t)# 1 Then

Cbapter 3 NEW RTRsurr ALCoRITItMs 67

Step 3: Update a(t) exactly as in the CPRp Algonthm End Repeat END ALGORITHM OPRP

Similarly to the DPRI algorithm. it is possible to show that DPRp aigorithm is E-

optimal in every stationary environment. The proof is very similar to the proof for

convergence of the DPRi and it follows the same steps. Fint, we cm prove that if the m'

action is rewarded more than any other action from time to onward then the action

probability vector P(t) for the DPRp will converge to the unit vector e,. The next theorem

captures this result:

Theorem 3.4: If then exists an index m and n time instant toc= such that dm (t) > d, (t)

for al1 j, j # m and al1 t 2 to; then there exists an integer No such that for aU resolution

parameters N > No, p,(t) -t 1 with probability 1 as t -t -.

Proott The proof for this theorem aims to show that (pm(t)}t,, is a subrnartingale

satisfyiag sup ~ [ l p, (t) l] c 0 . Then, based on the submartingale convergence theorem CM

[ 1 11 (p, (t)},,, converges, which implies

&,(t + 1) - P, ( 0 1 QW] ,, m.

where q is an integral, bounded by O and r. such that p,,,(t+l) = p,,,(t) + c, A. Thus.

~[~ , ( t + 1) -p,(t) l ~ ( t ) ] = d,c,A 2 0 , for al1 t 2 (3. 11)

which implies that p,,,(t) is a submartingale. From the submartingale convergence

theorem. this implies that dmc,A+O with prabability 1. This in implies that c,+O

w.p. 1, and consequently. max(p, (t) - A,0) + O w.p. 1. Hence p,(t) + 1 w.p. 1. jrm

The next step in proving the convergence of ihis algorithm involves showing that

using a sufficiently large value for the resolution parameter N, di actions are chosen

enough number of times so that b,(t) will remain the maximum element of the estimate

vector d(t) after a finite time.

Theorem 3.5: If for each action q, pi(0) # O. then for any given constants 6 > O and

M< -, there exists No c 00 and to c such that, under DPRP, for al1 leaming parameters

N > b a n d a l l time t>b:

R(each action chosen more than M times at time t) 2 1-5.

Pnn,I= The prwf for ihis theorem is identical with the proof for the same theorem in the

case of the DPR~ algorithm. published in [18]. + + + These two theorems lead to the conclusion that the DPRp scheme is &-optimal in

al1 stationary mdom environments.

3.4 Simulation Results

In the context of ihis study, simulations were performed in order to compare the

rates of convergence of different Pursuit Algorithm in benchmark environments. in al1

the tests performed, an aigoriihm was considered to have converged if the probability of

choosing an action exceeded a threshold T (OaSl). If the automaton converged to the

best action (Le., the one with the highest probability of being rewarded), it was

considered to have converged comctly.

Before comparïng the performance of the automata, innumerable multiple tests

were executed to detennine the "best" value of the respective learning parmeters for

each individual algorithm. The vdue was reckoned as the "kst" value if it yielded the

fstest convergence and the automaton converged to the correct action in a sequence of

NE experiments. These best parameters were then chosen as the find parameter values

used for the respective algorithms to compare their rates of convergence.

When the simulations were performed considering the same threshold T and

number of experiments, NE, as ûornmen and Lanctôt did in [18], (i.e. T=0.99 and

-79, the learning parameters obtained for the DPRi algorithm in the (0.8 0.6)

environment had a variance coefficieni' of 0.2756665 in 40 tests performed. This

variance coefficient was not considered satisfactory for comparing the performance of the

Pursuit algorithms. Subsequent simulations were performed imposing stncter

convergence requirements by increasing the threshold T, and proportionaliy, the number

' Thc variance coefficient is defined as a/M, where a is the standd deviation and M is the mean.

of experiments NE, which yielded leaming parameters with smaller variance coefficients.

For example, the leaming parameter (N) for DPRi algorithm in the (0.8 0.6) environment,

when Td.999 and NE-750, exhibits a variance coefficient of 0.0706893, which

represents a much smaller variance. Therefore, in this thesis the simulation results for

Td.999 and NE equal to 500 and 750 experiments shall be presented.

The simulations were performed in different existing benchmark environments

with two and ten actions. These environments have also been used to compare a variety

of continuous and discretized schemes, and in particular the DPRI in [18] and to compare

the performance of the CPRp against other traditional VSSA in [27]. Furthemore. to

keep conditions identical, every estimator algorithm sampled dl actions 10 times each to

initialize the estimate vector. These extra iterations are aiso included in the results

presented in the following tables. Table 3.1 and Table 3.2 contain the simulation results

for these four dgonthms is two action environments. The probability of reward for one

action was fixed at 0.8 for d l simulations and the pmbability of reward for the second

action was increased from 0.5 to 0.7~. In each case, the reported results correspond to the

results obtained using the above-described "best" parameten.

The reader will obscrw that îhere is a considembk difference betwccn the results presented hem and the results presentcd in [29]. In [29], the parameter chosen was the one which gave correct convergence in 2!5 parallel cxperimenis. However, on testing the CPRP for 1 0 experiments, it wlls observecl that it yielded only 84% accuracy. Thus, in the case of the CPRP what we seek is the largest parameter, A, which yields correct convergence in all the 750 ruid 500 qeriments respectkly. Similarly, in the case of the DPRP we seek the smaIlcst integcr parmeter, N, which yields correct convergence in al1 the 750 a d 500 crpcrinients respectively.

Cha~ter 3: NEW PURSUT ALGO~UTHMS 71

Table 3.1: Comparison of the Pursuit algorithm in two-action benchmark environments for which exact convergence was required in 750 experiments (NE= 750).

Table 3.2: Comparison of the Pursuit algorithms ln two-action benchmark environments for which exact convergence was mquired in 500 experiments (NE=500).

dl

0.8

0.8

0.8

The results of these simulations suggest that, as the difference in the reward

probabilities decreases, i.e. as the environment gets more difficult to l em, the

Dkretized Pursuit algorithms exhibit a performance superior to the Continuous

algorithms. Also. compacing the Pursuit algocithms based on the Reward-inaction

paradigm with the h e u i t algorithms based on the Reward-Penalty paradigm, one c m

notice that, in general, the Pursuit Rewilrci-Inaction algorith are up to 20% faster than

the Reward-Penalty Pursuit aigorithm. For example. if di=0.8 and d2d.6. the

discretized DPRl converges to the correct action in an average of 105.64 iterations. and

d2

0.5

0.6

0.7

dl

0.8 1

0.8

N

20

58

274

No. of

Iterat.

49.07

105.64

430.20

10.8 1 0.7 1 217 1 357.99 1 297 1 364.09 1 0.011 1 789.29 1 0.0075 1 905.36

d2

O S

0.6

N

32

89

391

No. of

Iterat.

53.74

118.24

456.13

N

17

52

h

0.214

0.046

0.009

h

0.122

0.027

0.0072

No. d

Iterat.

55.45

19832

939.63

No. d

Itemta

44.75

97

No. d

Iterak

69.69

258.60

942.88

N

26

74

No. d

Iterat,

47.9

102.17

X

0.314

0.054

No, of

Iteraî.

43.12

171.85

X

0.169

0.036

No. of

Iterat.

55.50

199.93

Chapter 3: NEW niRsvrr 72

the DPRp algorithm converges in an average of 118.24 iterations. In the same

environment, the CPRi algorithm takes an average of 198.32 iterations and the CPRP

requires 258.60, indicating that the CPRi is 23% faster than the CPRp aigorithm.

Figure 3.1 illustrates the performance of these algonthms relative to the performance of

the CPRp algorithm.

- -- - - ---

Pecfomnmx ofthe Rirsuit Algori- relative to CP-RP

--- Env. (0.8 0.5) :

El Env. (0.8 0.6) / I

, - i Env. (0.8 0.7) ! :

Figure 3.1: Comptarison of the Pursuit algorithm in wo-action benchmark environments for which uocr convergence war required in 750 crperiments (NE= 750).

The existing two-action benchmark environments used to obtain the results

presented in Table 3.1 and Table 3.2 are characterized by having high reward

probabilities. To accurately reflect the performance of these algorithms, we have

performed the same simulations in thm other environments in which the probability of

reward are small. For this purpose, we fixed the reward probability of the fmt action to

0.2 and we varied the reward probability of the second action fkom 0.5 to 0.3. Table 3.3

presents the results obtained in these environments.

Table 3.3: Cornparrion of the Pursuit algorithms in new two-ocrion environments for which exact convergence was required in 750 experiments (NE= 750).

The results show that also in these environments, the schemes that employed the

Reward-inaction leaming paradigm exhibit higher performance than the Rewanl-Penalty

schemes. Furthemore, the discretized schemes prove to be faster than the continuous

SC hemes.

For completeness. similar experiments were perfomed in the benchmark ten-

action environments [LS], [TOI. Table 3.4 and Table 3.5 present the results obtained in

these envuonments for 750 and 500 expenments.

b

dl

0.2

0.2

0.2

d2

OS

0.4

0.3

A,

0.349

No. d

Iteraï.

54.25

h

0.089

N

12

27

89

0.109

0.0294

No. d

Iterat.

89.25

No. d

Iterat.

51.29

108.12

403.59

I

172.14 1 0.0273

797.78 1 0.005

1

255.01

1321.1 1

N

38

100

402

No. d

Iterat.

60.16

129.69

479.69

Tuble 3.4: Compurison of the Pursuit algorithm in ten-action benchmark environments for which cmct convergence was required in 750 experimnts (NE= 750).

Note: The Reward probabilities for the actions are:

EA: 0.7 0.5 0.3 0.2 0.4 0.5 0.4 0.3 0.5 0.2

Es: 0.1 0.45 0.84 0.76 0.2 0.4 0.6 0.7 0.5 0.3

Envi

ron.

EA

Ee

Tdk 3.5: Cornpurison of the Pursuit algorithm in ten-action benchmark environrnenfs for which uocr convergence wpr required in 5W experiments (NE=500).

As in the previous two-action environments, in ten-action environments. the DPRI

algorith proved to have the best performance, converging to the correct action almost

25% faster than the DPRp aigorithm, and almost 50% faster than the CPRp algorithm. If

we malyze the behavior of these automata in the f i t environment, Eh when NE=7SO,

the average number of iterations required by the DPru to converge is 752.7, whereas the

DPRp required 1126.8, implying that the DPRi algorithm is 33% faster than the DPRp

algoritbm. In the same environment, the CPm requhs an average of 1230.3 iterations for

DPru

Envi

mn.

EA

EB

N

188

1060

No. of

Iteraî.

752.7

2693.7

DPiW

N

572

1655

DPRI

No. d

Iterat.

1 126.8

3230.3

c p ~ l

N

153

730

k

0.0097

0.002

CPRP

No. of

Iterat.

656.73

2084

DPRI

No. ol

Iterat,

1230.3

4603

A

0.003

0.00126

N

377

1230

No. of

Iterat.

2427.3

5685

No. d

Iteraî.

872.56

2511

C h

A,

0.0128

0,00225

CPRP

No. d

Iterat,

970.32

4126.58

A

0.0049

0.00128

No. of

Iterat.

1544.17

5589.54

Cbapter 3: NEW PURSUIT ALGO- 75

convergence, and the CPW requires 2427.3, which shows that the CPRl algorithm is 50%

faster than the CPRP algorithrn, and the DPRi is dmost 70% faster than the CPRP.

Based on these experimental results, we can rank the various Pursuit algonihms in

terms of their relative efficiencies - the number of iterations required to obtain the same

accuracy of convergence. The ranking is as follows:

B a t Algorithm: Discretized Pursuit Reward-Inaction (DPRi)

2ad-best Algorithm: Discretized Pursuit Reward-Penalty (DPRP)

sd-best Algorithm: Continuous Pursuit Reward-Inaction (CPRr)

4%est Algorithm: Continuous Pursuit Reward-Penalty (CPRP)

Also, the simulation results have shown that the discretized Punuit algorithms are

up to 30% fûister than their continuous counterparts. Furthemore. comparing the

Reward-Inaction Pursuit algonthms against the Reward-Penalty algorithms it cm be seen

that the Reward-Inaction algorithm are superior in the rate of convergence; they are up

to 25% faster than their Reward-Penalty counterparts.

3.5 Conclusions

This chapter extends the class of h u i t Algorithms by introducing two new

algorithms resulted from the combination of the Reward-Penalty and Reward-Inaction

leaniing paradigms in conjunction with the continuous and discrete models of

computation. The new algonthrns introduad are the Continuous Reward-Inaction

Pursuit Algorithm and the Discretized Reward-Penalty Pursuit algorithm. Furthemore,

in this chapter we argue that a leaming scheme that utilues the most recent nsponse of

the Environment permits the leaming algorithm to utilize the long-ienn and shon-rem

perspectives of the Environment.

This chapter contains a detailed description of these algorithms and the proofs of

theu convergence. Also, simulation results regarding the convergence of these

algorithms were perfonned and a quantitative cornparison between the performance of ail

the Pursuit algorithms was presented.

Overall, the Discnte Puauit Reward-Inaction aigorithm surpasses the

performance of dl the other versions of Pursuit aigonthms. Also, the Reward-Inaction

schemes are superior to their Reward-Penalty counterparts.

Chapter 4: GENERALIZATION OF TEE P~RSWT ALGORITHM~

4.1 Introduction

The main idea that characterizes the Pursuit algorithms presented in the previous

chapters is thrt they 'pursue' the best-estimated action, which is the action correspondhg

to the maximal estimate a, (t) . In any iteration, these algorithms increase only the

probabiüty of the best action, ensuring that the probability vector P(t) moves towards the

solution that has the maximal estimate at the current time. This implies that if, at any

time 't', the action that has the maximum estimate is not the action that has the minimum

penalty probability, then the automaton punues a wrong action.

In an ûttempt to minirnize this probability of pursuing a wrong action, our goal in

this chapter is to generalize the design of the Pursuit algorithm such that it pursues a set

of actions. Specificaily, these actions have higher reward estimates than the current

chosen action.

' The work presenied in ihis chapter is comprised in "Continuour md Dircretbd Generalized Purnuit Learning Schemes", authored by M. Agache and J.B. Oommen, submitted for publication at SSCI'2000 [ 11.

Figure 4.1 presents a pictorial representation of the two Pursuit approaches of

converging to an action. The fmt approach, adopted by the Pursuit Algorithms described

in Chapter 3, such as CPRP, CPRb DPRP, DPRl, always pursues the best-estimated action.

The present approach, adopted by the Generalized Pursuit Algorithms which we present

here, does not follow ody the best action - it follows ail the actions that are "better" than

the current chosen action, i.e. the actions that have higher reward eshates than the

chosea action.

O IO.

figure 4.1: Solution approclch of the CPRp Pursuit Algorithm and the Generalized Pursuit algorithm

In a vectoriai fom, if action a,,, is the action that has the highest reward estimate

at time 't', the Pursuit Algorithms aiways pursue the vector e(t) = [O O . . . 1 0.. .O lT,

where e,(t)=l. In contrast, if denotes the chosen action, the Generaiized Pursuit

Algorithm pursues the vector Nt), when

e j ( O = k if if é, S j(t) (t) > sir di (t) (t) for j#i

e, (t) = if ai (t) = rnax~d (t) ) otherwise

Chapter 4: G ~ R A L I W T I O N OF THE PURSUIT ALCOIUTHM 79

Since this vector e(t) represents the direction towards which the probability vector moves,

it is considered the direction vector of the Pursuit Algorithms.

This chapter presents two versions of Generalized Pursuit algorithms, followed by

a comparative study of the pedomance of these algonthms with the existing Pursuit

algorithms. The fint algorithm introduced in this chapter is the Generalized Pursuit

Algorithm. This algorithm moves the action probability vector "away" from the actions

that have smailer reward estimates, but it does not guarantee that it incnases the

probability for dl the actions with higher estimates than the chosen action.

Next, a discretized Generalized Pursuit Algorithm is presented. This aigorithm

follows the philosophy of a Generaiized Pursuit Algonthm in the sense that it increases

the action probability for dl the actions with higher reward estimates than the cumnt

chosen action.

Due to their generalized philosophy, in environments with two actions, these

algori thrns degenerate to become the exis ting Pursuit aigori thms. S pecificall y, in two

action environments, the GPA algo&.h becomes the CPw aigorithm, and the DGPA

algorithm becomes the DPRp aigorithm. For this reason, these algorithms were tested

only in benchmark ten-action environments, and the simulation results are presented in

the section 4.4 of this chapter.

The following sections present the new Generalized Pursuit Algorithms.

4.2 Genercilized Pursuit Algoritbm

The Generalued Pursuit Algorithm (GPA) presented in this section, is an example

of an algorithm that generalizes the Pursuit Algorithm introduced by Thathachar and

Sasiry in [28], referred to as the CPRp algorithm. It is a continuous estimator aigorithm.

which moves the probability vector towards a set of possible solutions in the probability

space. Each possible solution is a unit vector in which the value ' 1' corresponds to an

action that has a higher reward estimate than the chosen action.

The CPRp algorithm increases the probability for the action that has the higher

reward estimate. and decreases the action probability for al1 the other actions. as shown in

the following updating equations:

To increase the probabiiity for the best-estimated action and to dso pnserve P(t) a

probability vector, the Thathachar and Sastry's Pursuit algorithm fint decreases the

probabilities of all actions:

The remaining amount A that determines the sum of the probabilities of al1 actions to be

' 1 ' is cornputed to:

In order to hcrease the probability of the bestestimated action. the CPRp PUrSuit

algorithm adds the probabüity mass A to the probability of the best-estimated action:

In contrast to the CPRp algorithm. the newly inuoduced GPA algorithm equally

distributes the remaining amount A to al1 the actions that have higher estimates that the

chosen action. If K(t) denotes the number of actions that have higher estimates than the

chosen action rt time 't', then the updating equations for the Generaiized Punuit

Algorithm are expressed by the following equations:

pi(t + 1) = 1 - C p j ( t ) * jti

In vector form, the updating equations c m be expressed as follows:

where e(t) is the direction vector defined in (Eq 4.1).

Based on these equations, it can be seen that the GPA algorithm increases the

probability for al1 the actions with higher reward estimates than the estimate of the

chosen action, and satisfying the foiîowing inequality:

Formally, the Generaüzed Pursuit Algorithm cm be described as follows:

ALGORITHM GPA

Parameters k the leaming parameter , where O c h < 1 m index of the maximal component of d(t ). a, ( t) = max { 8, (t ) )

i = L r

Wi(t) the number of t h e s the i4 action has been rewwded up to the time t. with 1 5 i S r &(t) the number of times the P action has been chosen up to the time t, with 1 5 i I r

Method Initialization pi(t) = l/r, for 1 S i l r

Initiaiize &t) by picking each action a small number of times. Repeat

Step 1: At time t pick Mt) according to probability distribution P(t). Let a(t) = Step 2: If K(t) represents the number of actions with higher estimates than the

chosen action at time t, update P(t) according to the following equations:

Step 3:

End Repeat

j#i

Update a( t ) according to the following equations: Wi(t + 1) = Wi(t) + (1 -P(t))

Zi(t +l )=Z&t)+ 1

wj (t + 1) = wj (t) Zj(t + 1) =Z,(t) , for ai i jti 1

END ALGORITHM. GPA

As for the pnvious h u i t algorithms, the convergence of the GPA is proven in

two steps. Fit, we demonstrate that using a sufficiently small value for the leamhg

parameter k. aii actions are chosen enough number of times sucb that dm (t) wiii remain

the maximum eiement of the estimate vectord(t) dter a finite the . Formaily, this is

expressed as foiiows:

Theorem 4.1: For any given constants 6 > O and M < m. there exist A' > O and t~ < - such that under the GPA algocithm. for ail ÂE (O, kg),

Pr[Aii actions are chosen at least M times each be fore time t] > 1-6. for al1 t 2 b,

Rwl: The proof of this theonm is analogous to the proof of the corresponding result for

the TSE aigorithm. We shall consider the same randorn variable YIt as the number of

times the ih action was chosen up to time 't' in any specific realization,

From the updating equation (Eq.(4.4)),

chosen, we have:

at any step 't' in the aigocithm. if the action is

The probability of the chosen action pi(t) can either be (1 - A) pi (t - 1) . if there are other

actions with better estimates than @, or, it can be (1 - A) - pi (t - 1) + A. if the chosen

action bas the maximal reward estimate. In both these situations. the foîiowing inequality

is valid

The equations (4.9) and (4.10) show that the following inequality is valid for ai i the

actions:

Cbppttr 4: GENEMLIWTION OF THE PUR!~UIT ALGORITHM 84

which irnplies that during any of the f i t 't' iterations of the algorithm

Pr(@ is chosen) 2 ~~(o)*(l-Â)~, for any i=1,. . .,r (4. 12)

With the above clarified, the reminder of the proof is identical to the proof for the TSE

dgot-ithm and omitted (it cm be found in [28]). O * +

The second step in proving the convergence of the GPA consists of demonstrating

that if the mth action is rewarded more than any other action from time onward, then the

action probability vector converges in probability to e,. This is shown in

Theorem 4.2.

Theorem 4.2: Suppose that there exists an index m and a time instant to < - such that

then p,,,(t)+l with probability 1 as t + m.

Prook We shaU demonstrate that the sequence of random variables (p,,,(t)JW is a

submwtingaie. The convergence in probability results then from the submartingale

convergence theorem [ 1 11.

Consider

AP,,,(~) = ~ [ p , ( t + 1) - P J ~ ) IQ(OI where Q(t) is the state vector for the estimator algorithms which consists of P(t) and

d(t).

From the updating equations (4.4) and the assumptions of this theorem. pm(t+l) can have

be expressed as:

if aj ischosen j # m

This implies that for ail t 2 b, Ap,,,(t) can be calculated to be:

Apm(t) =l [^ -k rpm( t ) * ~ ~ ( t ) + [ k ( l -pm(t))l.pm = j., K(t) 1

Hence, p,(t) is a submartingale. By the submartingale convergence theorem [Il] ,

(pm(t) converges as t + œ,

E[pm(t+ 1)- pm(t) I Q(t)]+ O with pro bability 1.

Hence, p,,,(t)+l with probability 1, and the theorem is proven. + O +

Final1 y, the E-optimal convergence result can be stated as follows:

Theorem 4.3: For the GPA aigorith, in every stationary randorn environment, there

exists a h' fi and b>O, such that for ai i XE (O, h*) and for any &(O, 1) and any EE (0, 1),

~ [ ~ , ( t ) >1 -+l -6

for aii t>b. +++ The pmof for this theocem results as a logical consequence of the previous two

theorems, Theorem 4.1 and Theorem 4.2.

The simulation nsults ngarding the performance of this algorithm are presented

in the section 4.4 of this chapter. The next section presents another version of a

Generalized Pursuit aigorithm.

4 3 Discretized Generaüzed Pursuit Algorithm

The Discretized Generalized Punuit Algorithm. denoted DGPA. is another

algorithm that generalizes the concepts of the Punuit algorithm by "pursuing" al1 the

actions that have higher estimates than the current chosen action. This algorithm moves

the probability vector P(t) in discrete steps, but the steps do not have equd sizes.

At each iteration. the algorithm counts how many actions have higher estimates

than the current chosen action. If K(t) denotes this number, the DGPA algorithm

increases the probability of al1 the actions with higher estimates with the amount LVK(t),

and decreases the probabilities for al1 the other actions with the amount N(r-K(t)). where

A is a resolution step, A=l/rN with N a resolution parameter.

Vectoridly, the updating equations c m be expressed as foîlows:

where Nt) is the direction vector defined in (Eq. 4.1) and u is the unit vector uj=l,

j=12 ,..., t.

A detailed description of the algorithm is given below:

Chaptcr 4: GENL~JUWWTION OF THE W ü ï ï AM;ORtTHM 87

ALGORITHM DGPA

Parameters N resolution parameter K(t) the number of actions with higher estimates than the cumnt chosen action

1 A the smallest step size A = -

rN Wi(t) the number of times the P action has been rewarded up to the time t, with 1 S i S r z(t) the number of thes the rh action has been chosen up to the time t, with I 5 i 5 r

Metbod Initialization pi(t) = 1 Ir, for 1 S i l r

Initiaiize d(t) by picking each action a small number of times. Repeat

Step 1: At time t pick a(t) according to probability distribution P(t). Let Nt) = ai. Step 2: Update P(t) according to the following equations:

(Vj) j # i,such that 2 (t) > di (t)

p,(t + l)=rnax(pj(t)- A

,O) (b'D j F i,such that d, (t) 5 di (t) r - K(t)

jzi

Step 3: Same as in the GPA algorithm End Repe~t END ALGORITHM DGPA

In contrast to the GPA algorithm, the DGPA always increases the probability of

dl the actions with higher estimates.

To prove the convergence of thîs algorithm. we shaîi Est prove that the DGPA

possesses the moderation propecty (see section 2.5.3). Then we prove that this algorithm

also possesses the monotone property. As a consequence of these properties, the Marlrov

chah (p,(t)) is a submartingaie [10], and by the submartingale convergence theorem

[Il], results the convergence of the algorithm.

Chapter 4: GGNERALIW~ON OF THE PURSüïï AU;Oiüï"HM 88

Theorem 4.4: The DGPA possesses the moderation property.

hl: To prove this property, we shail demonstrate that the value 11rN bounds the

magnitude by which any action probability can decrease at any iteration of the algorithm:

From the updating equations (4. 15), the amount that a probability decreases is computed

to:

and the result is proven. * *+ The second step in proving the convergence of the DGPA consists of

demonstrating that the DGPA possesses the monotone property. This is shown in

Theorem 4.5.

Theorem 4.5: The DGPA possesses the monotone property.

Suppose that there exists an index m and a time instant to c - such that

&,(t) > dj(t), (VD j + m.(Wt > t,,

then there exists an integer No such that for ail N>No. p,(t)+l with probability 1 as

t+-.

Rod: The proof for this theorem aims to show that (p, (t)It, is a submartingale

satisfyiag sup ~ [ l p, (t) l] < - . Then, based on the submartingale convergence theorem UO

Cbapter 4: G m e w w n o ~ OF THE PURSUIT ALGOR~THM 89

&,O + l)-p,(t) IQ(~)] ,, BO-

Consider

AP&) = &,& + 1) - P&) l ~ ( t ) ]

where Q(t) is the state vector for the estimaior algorithrns.

From the updating equations (4.15) and the assumptions of this theorem, p,(t+l) cm be

expnssed as:

A p,(t+l)=p,(t)+- if a, is chosen J # m

K(t)

Pm(t + i ) = i - C pj(t)-- =pn(t )+A ( r o i ) if a, is chosen

jtm

This implies that for dl t 2 t ~ , Apm(t) can be caiculated to be:

Hence, pm(t) is a submartingale. By the submarcingale convergence theorem [Il],

{pm(t) ) rn, converges as t + -,

E[pm(t+l)- pm(t) I Q(t)]+ O with probability 1.

Hence, pm(t)+ 1 with pmbability 1, and the theorem is proven. ++* Since the DGPA possesses the moderation and wwiotoay proprties, it implies

that the DGPA is &optimal [IO].

Cbapter 4: GENERA~~WTION OF THE PURSUT ALGORITHM 90

The next section presents the simulation results for the Generalized Pusuit

algorithms introduced in this cbapter.

4.4 Simulation Results

The simulation results presented in this section have been obtained in the same

conditions as the results presented in section 3.4. First, simulations were performed in

order to determine the optimal learning parameter for each leaming aigorithm. Then,

using these learning parameters, each aigorithm was required to converge in 750

experiments and the average number of iterations was computed. For each algorithm,

100 test were performed and the number of iterations presented in this section is the

average over these experiments.

In a two-action environment. the GPA algorithm reduces in a degenente manner,

to the CPRP Pursuit algorithrn, and the MjPA reduces to DPRp aigorithm. For this reason.

simulations were performed only in ten-action benchmark environments and the results

are presented in Table 4.1. For simulation nsuits of the CPRp and DPRp algorithms, see

section 3.4.

Table 4.1: Pe $ormance of the generolked Pursuit algorithrns in benchmark ten-action environments for which exacï convergence was required in 750 experiments (NE=7'iO)

Note: The reward probabilities for the actions are:

EA: 0-7 0.5 0.3 0.2 0.4 0.5 0.4 0.3 0.5 0.2

EB: 0.1 0.45 0.84 0.76 0.2 0.4 0.6 0+7 0.5 0.3

Environ-

ment

E A

EB

Since the GPA aigorithm was designed as a generalization of the continuous

reward-penalty Puauit algoriihm, a cornpuison between these two algorithms is

presented. The results show that the GPA algorithm is 60% faster than the CPRp

algorithm in the EA environment and 51% faster in the EB environment. For example, in

the Ea environment, the GPA converges in average in 948.03 iterations, and the CPRp

algorith requires in average 2427 iterations for convergence, which shows an

improvement of 60%. In the Ee environment, the CPnp algorithm required on average

5685 number of iterations whereas the GPA algorithm required on average only 2759.02

iterations for convergence, king 5 1 % faster than the C P R ~ algorithm.

Similady, the DGPA is classified as a reward-penalty discretized Pursuit

aigorithm. If compaced against the DPRp aigoritbm, the DGPA proves to be up to 59%

faster. For example, in the Es environment, the DGPA algorithm converges in an

average of 1307.76 iterations whereas the DPW algorithm quires 3230 iterations. Also,

GPA

A,

0.0127

0.0041

No.&

Iterat,

948.03

2759.02

DGPA

N

24

52

N o . d

Iteraî.

633.64

1307-76

cplv

h

0.003

0.00126

Dpni

No. of

Iterat.

2427.3

5685

N

188

1060

No. d

Iterat.

752.7

2693.7

Cbapter 4: C ~ ~ i w n o N OF= ALGORITHM 92

the DGPA algorithm proves to be the fastest Pursuit algorithm, king up to 50% faster

than the DPRi algorithm. For example, in the same environment Ee, the DGPA aigorithm

requins 1307.76 and the DPRi algorithm requires 2693 iterations for convergence.

The foiiowing figure pnsents the graphical representation of the relative

performance of these aigorithms to the CPRp algorithm in benchmark ten-action

envimnments.

----- --- -

,

; O DGPA ! B OP-RI b Q DP-RP '.-A i

'OCP I-! RI :

Figura 4.2: Perfotmunce of the Pursuit AIgorithms relative tu the CPW afgoriihm in ten-action environments for or which exact convergence was required in 750 experiments

Based on these experimental results and considering the number of iterations

required to attain the same accuracy of convergence, we can rank the six Rusuit

algorithms as follows:

Best Algorithm Discretized Generalized h u i t Algorithm (DGPA)

2"'beSt Aigorith: Discretized Pursuit Reward-Inaction @PRL)

3*-best Algorithm Generalized Rirsuit Algorithm (GPA)

4d-best Algorithm: Discretized Pursuit Reward-Penalty OPRP)

5%est Algorithm: Continuous Pursuit Reward-Inaction (CPRi)

6&-best Algorithm: Continuous Punuit Reward-Penalty (CPRP)

4.5 Conclusions

This chapter introduced another extension of the class of Pursuit estimator

algorithms by pnsenting the generalization concept of the Rusuit algorithm. The h u i t

algorithms presented in Chapters 3 "pursue" the action that has the maximal reward

estimate at any iteration. Through generalization, the generalized Pursuit aigoriihms

"punue" the actions that have higher estimates than the current chosen action,

minimizing the probability of punuing a wrong action.

Two versions of generalized Pursuit algorithms were presented in this chapter, the

Generalized Pursuit Algorithm (GPA) and the Discretized Generaiized Pursuit Algorithm

(DGPA), dong with the pcoofs for their convergence.

Simulation results were presented to characterize the performance of these

algorithms. Based on the experimental resuits, the generaîized Pursuit algoritbms prove

to be faster than Pursuit algorithms in environments with more than two actions.

Furthermore, in the same environments. the Dirretizeà Generaîized Pursuit Algorithm

proves to be the fastest converging Pursuit algorithm.

Chapter 5: GENERALIZATION OF THE TSE AL GO^

Thathachar and Sastry introduced the concept of estimator leaniing algorithms by

presenting the TSE algorithm [28]. As described in Section 2.5.2.2, the authors of this

algorithm presented its updating equations in a scalar form. in this chapter, we fmt

present a vectorid representation of the updating equations of the TSE algorithm. A

genenlization of the TSE algorithm. derived from its vectoriai fonn, is presented next,

dong with simulation results that characterize the performance of this aigorithm.

5.1 Vectodai representat ion of the TSE aigorithm

The purpose of this section is to present a vectorial representation of the TSE

aigorithm, which better outlines the underlying concepts of this algorithm.

The TSE algorithm is an estimator algorithm that moves the probability vector

P(t) towards the unit vectors associated with al1 the actions aj that have higher reward

estimates than the chosen action ai

dj(t)>di(t).

' The contributions puentcd in this chapter an cumntiy bcbg compiled in a paper ihnt wiII be submitted for pubiimtion.

Chaptcr 5: GI~EMLUATION a p ~ t i e TSE A M ; O m 95

Viewed from a scalar perspective, the updating equations can be specified as in (2.39).

where these actions are identifcd using the indicator function Sij(t). To identify the same

set of actions in vectorial fonn, a vector direcrion e(t) can be defined as foliows:

, where

iti,(t)=max(d,(t)) e , (t) =

otherwise

The actions for which e&)=L have the action probability n(t+l) increased, whereas

the actions for which ek(t)=û have the action probability ~ ( t + l ) decreased.

From the updating equations (2.39). it can be seen ihat the TSE algorithm

incnases the action probability proportional to the distance to the estimate of the chosen

action d, (t) - di (t) . To represent the same concept in a vectorial fonn, we defme a

distance matrix Fm as follows:

Chapter 5: G R ~ ~ A L ~ T I O N op THE l'SE ALGOWTHM %

..* ... . . . . m . . . . ..* Fi (t) = 1 - f ( - - r(p, (<) -a,(') ) *.. ... - f(1ii(O -J.(~)I ) 1 (5.2)

where f : [- l,l]+ 1,1] is a monotonie, increasing func tion satisfying flO)=û. It is easy to

note that computing the Fi(t) rnatrix does not increase the computational complexity of

the TSE algorithm. To define the mavix Fi(t) one has only to compute the values

f(ld,(t) -dj(t)l), for any j.1.2, .... r, as in the case of a scalar fom reprerentation.

Furthemore, as we shall see, the actual implementation does not necessarily utilize the

vector concepnialization.

To preserve P(t) as a probability vector, we introduce a weight ma& V(t) as

follows:

vij (t) = O, if j + i,

Using the above-defined matrices, the updating equations for the TSE algorithm can be

or, equivalentiy:

Cbapter 5: GENEUL~TION op THE TSE ALGORITHM 97 --

P(t + i) = (1-A*F(~) * ~ ( t ) ) * P(t) + h O F ( ~ ) V(t) -e(t) (5- 5 )

where 1 is the identity matrix of dimensions m.

The vectorial representation of the TSE algorithm6, Eq. (5.4). shows that,

concepniall y, the TSE aigorithm rnoves the action probabil i ty vector P(t) towards the

actions with higher estimates than the estimate of the current chosen action. This

movement is done proportional to the distance between the c m n t action probability

vector and the potential solutions e(t), and it involves a leaming parameter A, a factor of

the distance between the estimates, the matrix F(t), and a weight matrix V(t).

in this form, the updating equations of the TSE algorithm can easily be compared

with the following updating equation of the Continuous Reward-Penalty Pursuit

Algorithm.

P(t + 1) = P(t) + A (e, (t) - ~ ( t ) ) (5.6)

It is easy to see that the vectoriai representation of the TSE algorithm leads to various

physical interpretations of the convergence process of the TSE algorithm. Fint of dl, the

TSE algorithm "extends" the Rirsuit algorithms by punuing more than one action. i.e. it

pursues aü the actions with higher estimates. Furthemore, the equation (5.4) shows that

the TSE algorithm considers three mom factors in its updating equations: a fùnction of

the distance between the estimates, a ma& F(t), and a weight matrix V(t).

~1though wc c m represent the updating equocions of the TSE alganthm in vector form, the implementation of the algorithm can be done in a scalar fashion, as will be explained presently.

This vectoriai npresentation of the TSE algorithm sets the basis for pmiitting

various generalizations. The foiiowing section presents one such generalization of the

TSE algorithm.

5.2 Generaüzation of the TSE aigorithm

This section presents a class of algorithms that generalize the TSE algorithm by

permitting variations on the weight matrix V(t), described above. The purpose of the

weight mutrix V(t) is to ensure that the action probability vector P(t) is preserved as a

probability vector. When al1 the reward estimates are higher than the reward estimate of

the chosen action ai, the automaton incnases al1 the probabilities pj(t+l), jti. To ensure

that the amount by which these probabilities increase does not surpass the value of pi(t),

Thathachar and Sastry equally distributed the value of pi(t) to ail r-1 action with higher

pi (t) estimates. Hence, the incrernent of each probability pj(t) is multiplied with - . r -1

in this section, we present a generalized version of the TSE algorithm, denoted by

GTSE. In GTSE each increasing probability pj(t), j=l, .... K(t) is scaled with a factor

pi(t)q(t), where ~ ( t ) , j= 1,. . . ,K(t) are defined such that

and K(t) repiesents the number of actions with higher reward estimates than the chosen

action at time 't'.

Considering this, the weight matruc V(t) is defined as:

Cbapter 5: G-TION OF TSE ALGORITHM 99

The TSE algorithm becomes then a particular case of the GTSE algorithm, when y;(t)

takes the following value:

1 yj (t ) =- , for aü j= 1.. . . ,K(t).

r -1

Note that in this case the surn of ail the factors is less or equd to 1:

YI 0) + 'hW + . . + Y K ( ~ ) O ) ~ 1

Moreover, the factor y;(t) could be extended such that it represents the a prion

information about the action probabilities [14], but. at this time. this problem remains

open.

This single generalization leads us to a new learning strategy. described in detail

below.

ALGORlTHM GTSE

Parameters 5 the speed of learning parameter, where O c k c 1. m index of the maximal component of a( t ), a, (t) = max { ai (t) ) .

i d ... r

Wi(t) the number of times the ih action has been rewarded up to the time t. with 1 I i O r &(t) the nurnber of times the 1 i action has k e n chosen up to the time t, with 1 5 i d r

f E[-1 , 1 ]+l, 11 a monotonie, increasing function satisQing f(O)=û Me t h d Initirlbtio~~ pi(t)= llr, for l S i S r

Initiaiize &t) by picking each action a small number of times. Repeat

Step 1: At time t pick a(t) according to probability distribution P(t). Let Mt)= ai. Step 2: Update P(t) according to the foilowing equations:

Step 3: Update &t) according to the following:

w, (t + 1) = Wi (t) + (1 - P(t)) Zi(t + l)=Z,(t)+ 1

End R e p t END ALGORITHM GTSE

1 , for al1 j+i

As in the case of the TSE algorithrn, this generaiized version c m be proven to be

Goptimul in any stationary environment. The proof for the E-optimality of the GTSE

algorithm follows the same idea as the TSE algorithm. First, it can be shown that using a

sufficiently small value for the leaming parameter A, aîi actions are chosen enough

number of times so that a,(t) will remain the maximum element of the estimate

vectorh(t) after a finite time. Fonnally, this i s k stated and proved below.

Chapter 5: G ~ W W T I O N OF THE TSE AIGORITHM 101

Theorem 5.1: For any given constants 6 > O and M < a, there exist h' > O and < o.

such that under the GTSE algorithm. for al1 k(0, f ) ,

Pr[AU actions are chosen at least M times each before time t] 21-6. for al1 t 2 h

hf=

Let us define the d o m variable Y ' ~ as the number of times the i' action was chosen up

to tirne 't' in any specific realization. We must prove that Pr( Y'~>M)>I-~, which is

equivdent to

h[ Y', I Ml< 6. (S. 11)

The events { Y ' ~ = k} , ( Y', =j } are mutuaily exclusive for jgk. and so (5.1 1) simplifies to:

We shall prove now that if the action is chosen, then the action probability pi(t)

satis fies the following inequali ty :

pi(t)2(1-X)*pi(t-1) (5. 13)

From the updating equation (Eq.(5.10)), at any step 't' in the aigorithm if the action is

chosen, the action probability pi(t) becomes:

pi(t)=pi(t - l ) + A -

The action is chosea based on its action pmbabiiity ai tirne 't-l', and, therefore it cm

have any r e w d estimate value di (t - 1). Specificdy, the chosen action cq could eithes

Cbapter 5: C-WTION 'ï'ïjE TSE 102

be the action with the maximum reward estimate, or the action with the minimum reward

estimate, or one of the actions with an intermediate value of its reward estimate, as

expressed by the following cases:

Case 1:

Case 2:

Case 3:

Therefore, the action probability pi(t) cm have a range of values. depenàing on which

action is chosen.

We proceed to prove the inequaüty (5.13). if any action a(t)=q. i= 1 .. . .,r is

chosen, it is sufficient to prove that the minimum possible value of the action probability

of the chosen action, pi(t), satisfies this inequality. Based on the updating equations (5.10)

it is easy to see that pi(t) achieves the minimum value when the chosen action represents

the action with the minimum reward estimate, as described in Case 2. in the other two

cases, Case 1 and Case 3, there are some actions with smaller reward estimates than the

chosen action, and therefore, the action probabilities of these actions decrease. The

arnount with which these probabilities decrease is added to the action probability pi(t),

hence in these cases pi(t) does not achieve the minimum vdue. In Case 2, the action

probabilities increase for al1 the actions aj, j#i and the action probability pi(t) becomes

the minimum possible value of pi(t). This value is calcuiated to be:

Côapter 5: CENERAWWTION OF TLlE l'SE ALGORITHM 103

Because f(x) 5 1. and pj(t) 41, this implies:

From the definition of the y;(t) factors (Eq. (5.7)) we know that the sum of these factors is

unity, which implies that

This proves that

which implies that during any of the fint 't' iterations of the aigorithm

Pr(@ is chosen) 2 pi(0)*(l-k)L, for any i=1,. ..,r

Also, at any iteration of the algorithm. Pr(@ is chosen} 51.

The reminder of the proof is identicai to the proof for the TSE aigorithm and omitted. It

can be found in [28]. + 4 *

The next step in provhg the ~ O p t i ~ f i t y of the GTSE algorithm consists of

demonstrating that if thece is an action a,,, for which the reward estimate cemains

maximal after a finite number of iterations, then the mh component of the action

Cbapter S: G E N E ~ ~ ~ Z A T I O N OF THE TSE A ~ R I T H M 104

probability vector converges in probability to 1. This result is forrnally proven in the next

theorem.

Theorern 5.2: Suppose that there exists an index m and a time instant &) c - such that

then pm(t)+ 1 with probability 1 as t + m.

To prove this resutt, we shall demonstrate that the sequence of random variables

{pm(t)}mo is a submartingaie. The convergence in probability will nsult then from the

submartingaie convergence theorem [ I l ] .

Based on the assumptions of the theorem, p,(t+l ) can take the following values:

Frorn these equations we calculate:

(5. 18) if a, ischosen

Since Ab(t+L) for ail t 2 to, implies that that pm() is a submartingale. By the

submartingale convergence theorem [ 1 11, {pm(t) ) converges as t -t 00,

Chapter 5: GENEIULIWTION OFTHE TSE A L G O R ~ ~ M 105

E[p,,,(t+l)- p,,,(t) I Q(t) ]+ O with probability 1.

Hence, pm(t)+ 1 with probability 1. and the theorem is proven. + + +

Finally. the Theorem 5.3 States the ~opt inal i ty convergence for the GTSE

aigorithm, and can be easily denved from the two previous results.

nieonm 5.3: For the GTSE algorithm, in every stationary random environment. there

exists a k* >O and @, such that for dl XE (O, A*) and for any 6~ (O, 1) and any e E (O, 1 j,

~ r [ ~ , ( t ) >LE]> 1-6

for al1 t ~ o .

This proves that this generaiized TSE algorithm is Goptimul in every stationnry

environment.

The next section presents simulation results for the GTSE algorithm

53 Simulation ResuIts

This section presents simulation results perfomed to compare the performance of

the TSE algorithm with a specific version of the Generaüzed TSE algorithm. In

simulating the GTSE algorithm, the ~ ( t ) has been assigned to be lIK(t), where K(t) is the

number of actions with higher estimates than the chosen action q(t) at tirne 't'. or 1 if the

chosen action is the one with the highest reward estimate.

The simulation envhnments that have ken considered are the same benchmark

environments as the ones for which the simulations were described in Sections 3.4 and

4.4. Specificdy, the first set of simulations was perfonned in order to determine the

optimal value for the leaniing parameter for each of these algorithms. Then, using these

leaming parameters, each algorithm was required to converge in 750 experiments and the

average number of iterations was computed. For each algorithm. 100 test were

performed and the number of iterations presented in this section is the average over these

experiments.

For the two-action environments. the weight matrix V(t) for the TSE dgorithm is

identical with the weight manix for the GTSE algorithm. Therefore, simulations were

performed only for the benchmark ten-action environments EA and EB. Table 5.1

presents the results obtained in these environments.

Table 5.1: Performance of the GTSE and TSE algorithms in ten-action benchmark environments for which uae t convergence was required in 750 uperimcnts (NE=7SO)

1 1 GTSE

ment L NO. or n NO. Of NO. or

h [terat. Iterat. Iterat,

Note: The reward probabilities for the actions are:

EA:0.7 0.5 0.3 0.2 0.4 05 0.4 0.3 0.5 0.2

En: 0.1 0.45 0.84 0.76 0.2 0-4 0.6 0.7 0.5 0.3

The simulations results show that considering y;(t)=l/K(t), where K(i) is the nurnber of

actions with higher estirnates than the chosen action at time 't', the generaiized TSE

algorithm exhibits an increase in performance of up to 9% relative to the performance of

Chapter 5: C m i u t t w n o ~ OF THE TSE ALGORITHM 107

the TSE algorithm. For example, in the benchmark environment EB, the TSE algorithm

required for convergence on average 2745.64 iterations. whereas the GTSE algorithm

required on average 2485.69 iterations, king 9% faster than the TSE algorithm. Also. if

compared against the class of Rusuit algoriihms, the GTSE algorithm proves to be faster

than the CPRP and CPRI algorithm. and faster than the generalized Pursuit Algorithm

GPA. For example. in the EA environment. the GTSE algorithm requires on average

860.39 iterations. whereas the CPRl requires on average 1230.3 iterations, CFRp requires

on average 2427.3 iterations for convergence. and GPA requires 948.03.

The following chart shows the improvement of the performance of the existing

continuous estimator algorithrns rrlative to the performance of the CPRp algorithm.

- - - - -

Figure 5.1: Pe@urmance of the continuous Estimator Algorithm relative ro the CPnralgorithm in ten-action environments for or which exact convergence w u required in 750 uperiments

Cbpter S: G ~ E R ~ I W T I O N OP 'R~E TSE ALGORITHM 10s

It can easily be seen that based on these experimental results, the GTSE algorithm proves

to be the fastest continuous estimator algorithm in benchmark ten-action environments.

Although the GTSE is a continuous algorithm, simulations wen executed to

compare its performance against the discretized TSE algorithm (DTSE). To obtain the

simulation results for the DTSE algorithm, we first had to determine the set of leamhg

parametea that led to the best performance of the algorithm. Unlike the other estimator

algocithms, the DTSE algorithm has two learning parametea: N, the resolution

panmeter, and 8. Therefore, we had to determine both N md 0 for which the DTSE

algorithm exhibits the best performance when nquired to converge in NE experirnents.

Rather than sewching the entire (N, 8) parameter space, we have considend just few

values of 0 (0=2, 0=10, û=100), and determined the optimal value of N for these vdues

of 0. Although the results do not span the entire parameter space, it is clear that the

results are both typical and conclusive. The best performance of the DTSE aigorithm

was obtained with 8=2. and the simulation results are presented in Table 5.2.

Chapter 5: GENEIULIWTION OF IWC TSE ALGORITHM 109

Taôle 5.2: Pe~ormance of the GTSE and DTSE algorithms in benchmark ten-action environments for which exact convergence was required in 750 erperiments (NE=750)

Note: DTSE results were obtained with 0=2.

Enviromnent

E A

Es

From these results, one can notice that the DTSE algorithrn is up to 50% faster than the

GTSE algorithm. For example, in the Ee environment. the GTSE algorithrn required on

average 2485.69 iterations to converge whereas the DTSE algorithm required on average

1 126.86 iterations.

The problem of extending the GTSE algorithrn into the discretized world remains

open,

Thathachar and Sastry introduced the TSE aigorithm [29] and presented it in a

scalar form. In this chapter, we inuoduced a vectorial representation of the TSE

dgoriihm, which led to a better conceptuai perception of the TSE algorithm. This new

vectorial form of the TSE algorithm aiso provided the background for generaliuig the

TSE algorithm by extending the weight matrix V(t).

GTSE

0.068

0.038

DTSE

No. d

Iterat.

860.39

2485.69

N

99

248

No. d

iterat.

518.18

1 126.86

Chapter 5: G E W M U Z A ~ ~ N OF nte TSE ALCORITHM 110

A class of generaüzed TSE algonthms. GTSE. has been presented in this chapter,

dong with the pcwf of their E-optimality. Simulation results have been presented for a

specifîc GTSE algorithm which considered the factors y;(t)=l/K(t), where K(t) is the

nurnber of actions with higher estimates than the chosen action at time 't'. Although the

GTSE algonthrn can be represented in vector fonn, its implementation hm k e n done in

scalar form, as presented in the detailed descnption of the algorithm.

Based on the expecimental results, we submit that the new GTSE aigorithm is

superior to the TSE aigorithm. In the ten-action benchmark environments, it exhibits an

increased performance of up to 9% when compared against the well-luiown TSE

algocithm. Furthemon, if compared against dl the existing continuous estimator

algorithrns. the GTSE algorithm proves to be the fastest one.

The generaüzation of the TSE algorithm has been presented only in a continuous

probability space. Its version in a discretized probability space remins an open problem.

We conclude this section by mentioning that the novel results of this chapter are

cumntly king compiled into a potential publication.

Chapter 6: CONCLUSIONS

6.1 Surnmary

In this thesis, we have presented a study of the reported and of some novel

Estimator Algorithms in Leaming Automata. The thesis started by presenting an

overview of the field of learning automata and of some of the most important existing

algorithms. Tsedin [31] f i t introduced the concept of a leming automaton by

designing a Fixed Structure S tochastic Automata (FSS A) traditiondl y known as the

Tsetlin Automaton. Chapter 2 presented this automaton dong with other FSSA

automata. We then introduced the farnily of Variable Structure Stochastic Automata

(VSSA), and we explained various such learning automata (e.g. LRI, LRP, Lw). The

discretkation concept was then highiighted, and the discrelized versions of the

continuous VSSA aigorithm were presented. Finally, Chapter 2 also explained the

Estimator Algorithms. by descnbing some of the existing continuous and discrete

Estimator Algorithms such as CPw, TSE, DPRi and DTSE schemes.

The thesis continued by concentrating on the study of the h u i t Estimator

Algorithms. In Chaptet 3, we presented new versions of the Pursuit algorithms that

Chapter 6: CONCLUSIONS 112

resulted from the combination of the Reward-Penalty and Reward-Inaction learning

pamdigms with the continuous and discrete models of computation. Just as in the case of

Pursuit automata, these new algorithms were shown to be eoptimal in al1 randorn

environments. Also, in this chapter we presented a performance cornparison of the newly

introduced algorithms, the CPRi and DPRP, with the already existing versions of Pursuit

algorithms, the CPRp and DPRi. Using these simulation results. we ranked these h u i t

algorithrns based on their performance as follows:

Best Algorithm: Discretized Pursuit Reward-Inaction (DPR3

2*-best Algorithm: Discntized Pursuit Reward-Penalty (DPRP)

3rd-bost Algorithm: Continuous Pursuit Reward-Inaction (CPRi)

4'h-best Algorithm: Continuous husu i t Reward-Penalty (CPRP)

In Chapter 4, we continued the study of the Puauit estimator dgonthms by

considenng two generalized versions of the CPRP Pursuit aigorithm. These new

algorithrns generaiize the Pursuit concept by attempting to linearly 'punue' not only the

action with the maximal estimate, but also dl the actions with higher reward estimates.

The new automaton, referred to as the GPA automaton, is an exarnple of a continuous

generalized Pursuit algorithm. Using these same concepts, we then presented a

discretized version. DGPA, of a genernlized Pursuit algorithm. Regarding their

convergence pmperties, we proved that these two aigorithms are optimal. As part of

the evaiuation of these new algorithrns, simulations were performed to determine their

performance. Based on the expcrimentai results obtained. we concluded that the GPA

dgodthm is the f't-converging continuous Rusuit Algoam, and that the

Chapter 6: Co~ctusto~s 113

DGPA is the fmtest àiscdc Rvsult dgorithm, and, more general, the fastest Pursuit

algorithm. From the performance perspective, we ranked the Pursuit algorithms as

follows :

Best Algorithm:

2"%est Algorithm:

3 * - k t Algorithm:

4rd-best Algorithm:

5 " - k t Algorithm:

6'-best Algorithm:

Discretized Generalized Pursuit Aigorithm (DGPA)

Discretized Pursuit Reward-Inaction (DPR [)

Generallled Pursuit Algorithm (GPA)

Discretized Pursuit Rewiud-Pend ty (DPRP)

Continuous Pursuit Reward-Inaction (CPRI)

Continuous Rirsuit Reward-Pendty (CPRP)

in Chapter 5, we concentrated on the study of the TSE estimator dgorithm,

introduced by Thaihachu and Sastry. We fiat presented a novel vectofial representation

of the updating equations of this aigorithm. This vectorial fom (Eq. (5.4)) outiines the

concepts used by the TSE aigorithm in the leaming process, and it ailows for an easy

cornparison of the TSE aigorithm with the CPRp algorithm. Furthemore, this vectorial

representation set the basis for permitting various generalizations of the TSE aigorithm.

In ihis chapter we fonnulated a generaiized TSE algorithm, nferred to as GTSE, based

on variations of the weight matrix V(t) which appears in the vectorial representation. The

GTSE dgorithm bas been presented only in the continuous probability space, its

discretization remaining an open problem. Regarding its convergence, we proved that

this algorithm is ~optimul in every stationary environment. To complete the

characterization of the newly introduced GTSE algocithm. we presented simulation

Chapter 6: Co~c~usao~s 114

results. which show that the GTSE algorithm is the fastest reported continuous

estimator dgorithm.

In conclusion, in this thesis we have introduced five new estimator algorithms:

CPRi, DPRPI GPA, DGPA and GTSE. The following graph summarizes the performance

improvement of al1 these estimator algorithm relative to the performance of the CPRp

algo ri thm.

-- -- -- - - - - - - - - -- - -- - I

0.9 r

, O DGPA DP-RI : .

jPGTSE i

;a DP-RP; i.TSE ! I

p G P A j CP-RI ; 1

1

Figure 6 I: Performance of some Estimator Algorithm relative to the CPRp algorithm in ten- action environments for which exact convergence was required in 750 crplriments

Based on the experirnental results presented in this thesis, we conclude that

among al1 the algonthms that we have analyzed, the newly introduced GPA algorithm is

the fastest continuous Rusuit dgorithm. the DOPA is the fastest converging discretized

Pursuit estimator algorithm. and, ind#d the fastest Pursuit estimator algorithm. In the

class of the continuous estimator algorithrns. the GTSE proved to be the fastest

continuous estirnator algorithm.

6 3 Future work

In this thesis. the clnss of Estimator Algorithrns has been extensively studied ûnd

various new schemes have been introduced. In spite of this. this study proved that these

dgorithms can be expanded. Furthemore. opens some new research directions.

One interesting direction that sbould be followed is to try to incorporate a priori

information in the weight matrix V(t) of the generalized TSE (GTSE) algorithm.

Additionally. since a generalization of the TSE algorithm has ken presented only in the

continuous probability space, future research coufd concentrate on generaüzing the TSE

algorithm also in the discrete space.

M. Agache, BJ. Oommen, "Continuous und Discretized Generalized Pursuit Leaming Schemes", 4' World Multiconference on Systemics, Cybemetics and Infonnatics. July 23-26. Orlando, Flonda.

K. S. Fu. "Leaming Control Systems - Review and outlook", EEE Trans. on Automatic Control, Vol. 15. pp. 210-221, April, 1970.

K. S. Fu. "Loaming Control Systems und Intelligent Control Systems: An Intersection of Artijicial Intelligence and Automatic Control ", EEE Trans. on Automatic Control, Vol. 16, pp. 70-72, 197 1.

K. S. Fu, "Pattem Recognition and Machine Learning", Plenum Press. 197 1.

K. Fukunaga, "Introduction to Statistical Pattem Recognition", Acadernic Press, 1972.

D.L. Isaacson. R.W. Madsen, "Markov Chains: Theory and Applications", New York: John Wiley&Sons, 1976.

S. Lakshrnivarahan, ''Leamhg Algoriiluns Theory and Applications", New York: S pringer-Verlag, 1 98 1.

S. Lakshmivarahan, M.A.L. Thathachar, "Absolutely Expedienr Leaming Algorithms for Stochastic Automata", IEEE Trans. Syst., Man and Cybern., SMC- 3, 1973, pp.28 1-286.

J.K. Lanctôt. "Discrete estimator algori th: A mathematical mode1 of cornputer leaming ", M.Sc. Thesis, Dept. Math. Statistics, Carleton Univ., Ottawa, Canada, 1989.

I.K. Lanctôt, B. J. Oornmen, "Discretized Estimator Leaming Automta ", EEE Trans. on Syst. Man and Cybemetics, Vol. 22, No. 6, pp. 1473-1483, November/December 1992.

K.S. Narendra, M.A.L. Thathachar, "Leamhg Automata an Introduction", Prentice-Ha& 1989.

K.S. Nanndra, M.A.L. Thathachar, "Leaming Autoniata - A survey". IEEE T m . on Syst. Man and Cybemetics, Vol. SMC-4, 1974. pp.323-334.

K.S. Narendra, E. A. Wright, L. G. Mason, "Application of Learning Autowtata tu Telephone Truflc Routing and Contruf", IEEE Trans. on Sys., Man., and Cybem., Vol. SMC-7, No. 1 1 , November 1977.

B.J. Oommen, "Absorbing and Ergodic Discretized Two-Action Learning Automata", IEEE Trans. Syst. Man Cybern., vol SMC-16. no. 4, pp.282-2996, MadApr. 1986.

B. J. Oornmen, "A Learn ing Automaton Solution to the Stochastic Minimwn- Spanning Circle Problem", IEEE Trans. Syst. Man Cybem., vol SMC-16, no. 2, pp.282-2996, JulylAugust 1986.

B.J. Oommen and M. Agache, "A Cornparison of Continuous and Discretized Pursuit Learning Schemes", IEEE. international Conference on Syst. Man. Cybem., October 12- 19, 1999, Tokyo, Japan.

B .J. Oommen and J. P. Christensen. " Epsilon-optimal discretized reward-penalty learning automata", EEE. Trans. Syst. Man. Cybem., vol. SMC-L8, pp. 45 1-458, MayIJune 1988.

B.J. Oommen and J.K. Lanctôt, "Discretized Pursuit Learning Automata ", IEEE Trans. Syst. Man. Cybem., vol. 20. No.4, pp.93 1-938, July/August 1990.

B.J. Oornmen and E.R. Hansen. "The asymptotic optimrrlity of discretized linear reward-inaction leaming automata " , IEEE. Trans. S yst. M m . Cybem., pp. 542- 545, May/June 1984.

B.J. Oornmen and G. Raghunath. "Automata Learning and intelligent Tertiary Searching for Stochastic Point Location ", IEEE. Trans. Syst. Man. Cybem. - Part B. Cybemetics, Vol. 28, No.6, pp. 947-954, December 1998.

G.I. Papadimitsiou, "A New Approach to the Design of Reinforcement Schemes for Leaming Automata: Stochasric Estimator Leaming Algotithms", IEEE Trans. on Knowledge and Data Eng.. Vol. 6, No. 4, pp. 649654, August 1994.

A. Paz, "ln troàuction to Probabil istic Automata", New York Academic Press, 1971.

J. Slagle, "Artifkial Intelligence and Heuristic Rogramrning", McGraw Hill, 1975.

R. S. Sutton, A. G. Barto, "Reinforcement Lemhg - An Introâuction'*, The MIT Press, Cambridge, Massachusetts, 1998.

R. Solomonoff, "Sotne Recent Wonk in Artijcial Intelligence", Roc. of IEEE, Vol, 54, Dec. 1966.

M.A.L. Thahachar and BJ. Oommen, "Discretized reward-inaction learning automata ", J. Cybem. Information Sci., pp.24-29, Spring 1979.

M.A.L. Thathachar and P.S. Sastry, "Pursuit Algorithnt for Learning Automata " Transcript provided by the author.

M.A.L. Thathachar and P.S. Sastry, "A class of rapidly converging algorithm for leaming automata", presented at IEEE [nt. Conf. on Cybematics and Society, Bombay, India, Jan. 1984.

M.A.L. Thathachar and P.S. Sastry, "A New Approach to the Design of Reinforcement Schemes for Leaming Automata" EEE Tms. Syst., Man. Cybem., vol SMC-15, No. 1, pp. 168- 175,JanuarylFebruary 1985.

M.A.L. Thathachar and P.S. Sastry, "Learning Optimal Discriminant Functions Through a Cooperative Game of Automuta", EEE Trans. on Systems, Man, and Cybemetics, Vol. SMC- 17, No. l,lanuary/February 1987.

M.L. Tsetlin, "On the behavior offinite autmata in random media", Automat. Telemek. (USSR), vol. 22, pp. 1345- 1354, Oct. 1961.

M.L. Tsetlin. "Autonuiton Theory and the Modeling of Biological Systents ", New York: Academic, 1973.

R. Viswanathan and K.S. Narendra, "Stochastic Automata Models with Application to Learning Systems ", EEE Trans. Syst., Man. and Cyber., SMC -3, 1973. pp. 107-1 1.

V.I. Varshavskii and LP. Vorontsova, "On the behavior of stuchasric automata with vuriable structure ", Automat. Telemek. (USSR), vol. 24, pp. 327-333, 1963.

families of estmator-based stochastic learning · chapter 1: introduction 1.1 general description...

Documents