curs 1-12

Upload: rosu-lucian

Post on 15-Oct-2015

32 views

Category:

Documents


0 download

TRANSCRIPT

  • DENIS ENCHESCU University of Bucharest

    ELEMENTS OF STATISTICAL LEARNING. APPLICATIONS IN DATA MINING

    Lecture Notes

    1

  • 1 The Nature of Machine Learning

    1.1 Basic Definitions and Key Concepts

    Learning (understood: artificial, automatic) (Machine Learning) This concept includes any

    method making it possible to build a model of reality starting from data, either by improving a

    partial or less general model, or by creating the model completely. There are two principal

    tendencies in learning, that resulting from the artificial intelligence and qualified symbolic

    system, and that resulting from the statistics and qualified numerical.

    Precision vs. Generalization, the great dilemma of the learning. Precision is defined by a

    difference between a measured or predicted value and an actual value. To learn with too much

    precision leads to an "over-fitting", like the learning by heart, for which unimportant details (or

    induced by the noise) are learned. To learn with not enough precision leads to an "over-

    generalization" and the model applies even when the user does not wish it.

    Intelligibility (should be Comprehensibility but tends to become Understandability). For a few

    years, mainly under the push of the industrialists, the researchers have started to try also to

    7

    Curs 1

  • control the intelligibility of the model obtained by the mining of data. Until now, the methods of

    measurement of intelligibility are reduced to check that the results are expressed in the language

    of the user and that the size of the models is not excessive. Specific methods of visualization are

    also used.

    The criterion of success. The criterion of success is what is measured in the performance

    evaluation. It thus acts as a criterion relative to an external observer. For example, the

    performance will be measured according to the error count made by the learner in the course of

    learning, or according to its error rate after learning. More generally, the measurement of

    performance can include factors independent of the adequacy to the data of learning and of very

    diverse natures. For example, simplicity of the learning result produced by the learning machine

    (LM), its comprehensibility, its intelligibility by an expert, its facility to the integration in a

    current theory, the low computational cost necessary to its obtaining, etc.

    Here, it should be made an important remark. The criterion of success, measured by an external

    observer, is not necessarily identical to the performance index or to the loss function that is

    8

  • intern to the LM and used in the internal evaluation of the learning model. For example, an

    algorithm of learning of a connectionist network generally seeks to minimize a standard

    deviation between what it predicts on each example of learning and the desired exit.

    The protocol of learning. The learning and its evaluation depend on the protocol that

    establishes the interactions between the LM and his environment, including the supervisor (the

    oracle). It is thus necessary to distinguish between the batch learning, in which all the data of

    learning are provided all at the start, and the on-line learning in which the data arrive in

    sequences and where the learner must deliberate and provide an answer after each entry or

    groups entries.

    The protocol also stipulates the type of entries provided to be learned and the type of awaited

    exits. For example, a scenario can specify that at every moment the LM receive an observation

    , that it must provide an answer and only then, the supervisor produces the correct answer ix iy

    9

  • iu . One speaks then naturally about a prediction task. More, the tasks known as prediction are

    interested to envisage correctly a response in a precise point.

    In contrast, in the identification tasks the goal is to find a total explanation among all those

    possible, which once known will make possible to make predictions whatever the question.

    The scenario will be then different. By example, the learning system must yet provide after each

    new entry ( ),i iux an assumption on the "hidden function" of the supervisor by which this one determines as function of . It is conceived that the criterion of success is not the same in the

    case of a prediction task as in that of an identification task. In this last case, indeed, one asks

    much more from the LM since one awaits from him an explicit assumption, therefore a kind of

    explanation from his predictions.

    iu ix

    In addition, the LM can be more or less active. In the protocols described up to now, the LM

    receives passively the data without having influence on their selection. It is possible to consider

    scenarios in which the LM has a certain initiative in the search for information. In certain cases,

    10

  • this initiative is limited, for example when the LM, without having the total control of the choice

    of the learning sample, is simply able to direct its probability distribution; the boosting methods

    are an illustration of this case. In other cases, the LM can put questions about the class of

    membership of an observation, one speaks then of learning by membership queries, or to even

    organize experiments on the world, and one speaks then of active learning. The play of

    Mastermind, which consists in guessing a configuration of colors hidden pawns by raising

    questions according to certain rules, is a simple example of active learning in which the learner

    has the initiative of the questions.

    11

  • The task of learning. It is possible to approach the objective of the process of learning

    following several points of view.

    The knowledge point of view. The goal of the learning can be to modify the contents of knowledge2. One speaks then of

    knowledge acquisition, of revision, and, why not, of lapse of memory.

    The goal of the learning can also be, without necessarily modifying the "contents" of knowledge,

    to make it more effective compared to a certain goal, by reorganization, optimization or

    compilation for example. It could be the case of a player of chess or a mental calculator who

    learns how to go more and more quickly without knowing new rules of play or of calculation.

    One speaks in this case about optimization of performance (speed-up learning).

    2 Measured, for example, by its deductive closure, i.e., in a logical representation, all that can be deduced correctly starting from the current base of knowledge.

    12

  • The environment point of view. The task of the learning can also be defined compared to what the learning agent must carry out

    to survive in its environment. That can include:

    To learn how to recognize patterns (for example: handwritten characters, birds, the predatory ones, an ascending trend of a title to the bourse, appendicitis, etc.). When the

    learning is done with a professor, or supervisor, who provides the wished answers, on have

    a supervised learning. If not, one speaks of unsupervised learning. In this last case, the task

    of learning at the same time consists in discovering categories and finding rules of

    categorization.

    To learn how to predict. There is then a concept of temporal dependence or causality. To learn how to be more effective. It is the case in particular of the situations of resolution

    of problem, or search for action plans in the world.

    13

  • The abstract classes of problems point of view. Independently from the learning algorithm, it is possible to characterize the learning process by a

    general and abstract class of problems and processes of resolution. Thus a certain number of

    disciplines, in particular resulting from mathematics or information theory, were discovered an

    interest for the problems of learning.

    The theories of compression of information. In a certain direction, the learning can be approached like a problem of extraction and compression of information. It is a question of

    extracting essential information or the initial message from an ideal transmitter, cleared of

    all its redundancies. In a sense, the nature sciences, such astronomy or ornithology, proceed

    by elimination of the superfluous or redundant details and by the description of hidden

    regularities.

    14

  • The cryptography. From the similar point of view, near to the goals of the information theory, the learning can be regarded as one try decoding or even of decoding of a message

    coded by the ideal transmitter and intercepted in whole or part by the learner agent. After

    all, it is sometimes like the scientist studying nature. It is then logical to see under which

    conditions a message can "be broken", i.e. under which conditions learning is possible.

    The mathematical / numerical analysis. The learning can also be examined like one problem of approximation. The task of learning is to find an approximation as good as

    possible of a hidden function known only by the intermediary of a sample of data. The

    problem of learning becomes often that of the study of the conditions of approximation and

    convergence.

    The induction. In the Seventies and at the beginning of the Eighties, under the influence from the cognitive point of view, a broad community of researchers, particularly active in

    France, is leaning on the learning as a problem of generalization. This approach starts from

    two essential hypotheses. First, the cognitive learning agent must learn something that

    15

  • another cognitive agent equivalently knows. It is thus normally able to reach the target

    knowledge perfectly. Second, knowledge and data can be described by a language. One

    seeks then the operators in this language who can correspond to operations of

    generalization or specialization useful for induction, and one builds algorithms using them,

    making it possible to summarize the data while avoiding the over-fitting and the drawing of

    illegitimate consequences.

    The applied mathematics. Finally, the engineer can be tempted to see in the learning a particular case of the resolution of an inverse problem. Let us take two examples:

    one can say that the theory of probability is a theory sticking to a direct problem (being given a parameterized model, which are the probabilities associated with such event?),

    while the theory of the statistics attacks an inverse problem (being given a sample of

    data, which model does make it possible to explain it, i.e. can have produced it?).

    16

  • being given two numbers, it is easy to find the product of it (direct problem). It is on the other hand generally impossible to find starting from a number those of which it is the

    product (inverse problem).

    The inverse problems are thus often problems that one known as ill-posed, i.e. not having a

    single solution. According to this point of view, the study of the learning can be seen like

    that of the conditions making possible to solve an ill-posed problem, i.e. constraints which

    it will have to be added so that the procedure of resolution can find a particular solution.

    The structures of data or types of concerned assumptions

    17

  • It frequently arrives that one imposes the type of structure (or the language of expression of the

    assumptions) that must be sought by the learning system. That makes it possible to guide at the

    same time the determination of the learning algorithm to be used, but also the data that will be

    necessary so that the learning is possible. Without to seek to be exhaustive, we quote among the

    principal structures of studied data:

    the Boolean Expressions, who are often adapted to learn concepts definite on a language of attribute-values(for example the rules of an expert system).

    the grammars and the Markovian Processes allowing representing sequences of events. the linear/nonlinear functions making possible to discriminate objects belonging to a

    subspace or its complementary.

    the decision trees who allow the classifications by hierarchies of questions. The corresponding decision tree is often at the same time concise and comprehensible.

    18

  • the logical programs who allow learning from the relational concepts. the Bayesian Networks allowing at the same time to represent universes structured by

    relations of causality, to take into account, and to express measurements of certainty or

    confidence.

    Sometimes the learning can consist in changing the structure of data in order to find an

    equivalent but most computational effective structure. It is once again, under another angle, the

    problem of performance optimization.

    19

  • To simplify, we will suppose that the LM seek an approximation of the target function inside a

    family of hypothesis functions. It is the case, for example, of the learning using a neurons

    network of which architecture constrained the type of realizable functions to a certain space of

    functions.

    H

    We defined the task of learning like that of a problem of estimating a function starting from the

    observation of a sample of data. We turn now to the principles allowing carrying out this

    estimate.

    The exploration of the hypothesis space. Let be a hypothesis space, a data space and

    a training sample. The task of learning is to find a hypothesis approximating as well as

    possible, within the meaning of a certain measurement of performance, a target function

    H X Sh

    f

    based on the sample ( ){ } 1,,i i i m in which one supposes that each label , was calculated by the function

    u == xS iuf applied to the data . ix

    20

  • How to find such a hypothesis hH ? Two questions arise: 1. How to know that a satisfactory hypothesis (even optimal) was found, and more generally

    how to evaluate the quality of a hypothesis?

    2. How to organize the research in ? H

    Whatever the process guiding exploration of , it is necessary that the LM can evaluate the

    hypothesis that it considers at each moment of its research. We will see that this evaluation

    utilizes an intern performance index (for example a standard deviation between the exits

    calculated from and desired targets provided in the training sample). It is this performance

    index, more possibly, the other information provided by the environment (including the user for

    example), which allows the LM to measure its performance on the training sample and to decide

    if it must continue his research in or it can stop.

    Hh t

    h u

    H

    By supposing that at the moment , the LM judge unsatisfactory his current assumption , how

    can it change it? It is there that the effectiveness of the learning is decided and in this context,

    t th

    21

  • the structure of the space plays an important role. More this one will be rich and fine, more it

    will be possible to organize the effectively exploration of . Quickly let us examine three

    possibilities in an ascending order of structuring:

    HH

    the space of hypothesis does not present any structure. In this case, only a random exploration is possible. Nothing makes it possible to guide research, nor even to benefit from

    H

    the information already gained on . It is the case where nothing is known a priori on . H H

    a concept of neighborhood is definable on . It is then possible to operate an exploration by techniques of optimization like the gradient method. The advantage of these techniques, and

    what makes them so popular, it is that they are of a very general use since it is often possible

    to define a concept of neighborhood on a space. A fundamental problem is that of the

    relevance of this concept. A bad neighboring relation can indeed move away the LM from

    the promising areas of the space! In addition, it is still a low structure, which, except in

    particular cases (differentiability, convexity, etc. of the function to be optimized), does not

    allow a fast exploration.

    H

    22

  • It is sometimes possible to have a stronger structure making it possible to organize the exploration of . In this case, for example, it becomes possible to modify an erroneous

    hypothesis by specializing it just enough so that it does not cover any more the new negative

    example, or on the contrary by generalizing it just enough so that it covers the new provided

    positive example. This type of exploration, possible in particular when the space of

    hypothesis is structured by a language, is generally better guided and more effective than a

    blind exploration.

    H

    By what precedes, it is obvious that more the structuring of the space of the hypothesis is strong

    and is adapted to the problem of learning, more the learning will be facilitated. On the other

    hand, of course, that will require a preliminary deliberation.

    23

  • 1.2 Short History The artificial learning is a young discipline at the common frontier of the artificial intelligence and the computer science, but it has already a history. We brush it here rapidly, believing that it is always interesting to know the past of a discipline because it can reveal, by the updated tensions, its major problems and its major options. The theoretical preliminary principles of the learning are posed with the first results in statistics in the years 1920 and 1930. These results seek to determine how to inhere a model starting from data, but especially how to validate an assumption based on a sample of data. Fisher in particular studies the properties of the linear models and how they can be derived starting from a sample of data. At the same period, the computer science born with the work of Gdel, Church and especially Turing in 1936, and the first simulated data become possible after the Second World War. Besides the theoretical reflections and the conceptual debates on the cybernetics and the cognitivism, the pioneers of the domain try to program machines to carry out intelligent tasks, often integrating learning. It is particularly the case of the first simulations of tortoises or cybernetic mice, which one places in labyrinths while hoping to see how they learn to leave it more and more quickly. On his side, Samuel at IBM, in the years 1959-1962, develops a program to play the American Backgammon, which includes an evaluation function of the positions enabling him to become quickly a very good player.

    24

  • In the years 1960, the learning is marked by two currents. On the one hand, a first connectionism, which under the crook of Rosenblatt father of the perceptron, sees developing small artificial neurons networks tested in class recognition using supervised learning. On the other hand, the conceptual tools on pattern-recognition are developed. At the end of 1960, publication of the book of Minsky and Papert (1969) which states the limits of the perceptron causes the stop for about fifteen years of almost all researches in this field. In a concomitant manner, the accent put in artificial intelligence in the years 1970, on knowledge, their representation and the use of sophisticated inference rules (period of the expert systems) encourages work on the learning systems based on structured knowledge representations bringing on the stage complex rules of inference like the generalization, the analogy, etc.

    25

  • Figure 1-1 The first period of the artificial learning

    26

  • It is then the triumph of impressive systems realizing the specific tasks of learning by simulating strategies used, more or less, in the human learning. It must be cited, the system ARCH of Winston in 1970, which learns how to recognize arches in a world of blocks starting from examples and counterexamples; the system AM of Lenat in 1976, which discovers conjectures in the field of arithmetic by the use of a set of heuristic rules or even the system META-DENDRAL of Mitchell which learns rules in an expert system dedicated to the identification of chemical molecules. It is also a period during which the dialogue is easy and fertile between the psychologists and the experts of the artificial learning. From where assumptions relating concepts like the short-term and long-term memories, the procedural or declaratory type of knowledge, etc. also the ACT system of Andersen testing general assumptions on the learning of mathematical concepts in education. However, also spectacular they are, these systems have weaknesses, which come from their complexity. Indeed their realization implies necessarily a great number of choices, small and large, often implicit, and who of this fact do not allow an easy replication of the experiments, and especially throw the doubt about the general and generic range of the proposed principles. It is why years 1980 saw gradually drying up work relating to such simulations with some brilliant exceptions like the systems ACT or SOAR. Moreover, these years saw a very powerful come back of connectionism in 1985, with in particular the discovery of a new algorithm of learning by the gradient descent method for multi-layer perceptrons. That deeply modified the study of the artificial learning by opening large the

    27

  • door at all the concepts and mathematical techniques relating on optimization and the convergence properties. Parallel to the intrusion of continuous mathematics, other mathematicians engulfed themselves (behind Valiant in 1984 ) in the breach opened by the concept of space of versions due to Mitchell.

    28

  • Figure 1-2 The second period of the artificial learning.

    29

  • Of only one blow the learning was seen either as the search for algorithms simulating a task of learning, but like a process of elimination of hypothesis not satisfying an optimization criterion. It was then a question within this research framework how a sample of data drawn by chance could make it possible to identify a good hypothesis in a given space of hypotheses. It was extremely misleading, and as the language used in this new research direction was rather distant from that of the experts of the artificial learning, those continued to develop algorithms simpler but more general than those of the previous decade: decision trees, genetic algorithms, induction of logical programs, etc. It is only in the years 1990, and especially after 1995 and the publication of a small book of Vapnik (1995 ), that the statistical theory of the learning truly influenced the artificial learning by giving a solid theoretical framework to the interrogations and empirical observations made in the practice of the artificial learning. The current development of the discipline is dominated at the same time by a vigorous theoretical effort in the directions opened by Vapnik and the theorists of the statistical approach, and by redeployment towards the application of the developed techniques to great applications of economic purpose, as the mining of socio-economic data, or with finality, like the genomic one. It is undeniable that for the moment the learning is felt like necessary in very many fields and that we live a golden age for this discipline. That should not however forget the need for joining again the dialogue with the psychologists, the teachers, and more generally all those which work for the learning in a form or another.

    30

  • Figure 1-3 The third period of the artificial learning.

    31

  • A non-exhaustive list of reviews specialized on the artificial learning is: Machine Learning Journal Journal of Machine Learning Research (available free on

    http://www.ai.mit.edu/projects/jmlr/) Journal of Artificial Intelligence Research (JAIR) accessible free on Internet

    (http://www.ai.mit.edu/projects/jmlr/) Data Mining and Knowledge Discovery Journal Transactions on Knowledge and Date Engineering

    32

  • Table 1.1 - Core tasks for Machine Learning

    Task category Specific tasks

    Classification Classification, Theory revision, Characterization, Knowledge refinement, Prediction, Regression, Concept drift

    Heuristics Learning heuristics, Learning in Planning, Learning in Scheduling, Learning in Design, Learning operators, Strategy learning, Utility problem, Learning in Problem solving, Knowledge compilation

    Discovery Scientific knowledge discovery, Theory formation, Clustering Grammatical

    inference Grammar inference, Automata Learning, Learning programs

    Agents Learning agents, Multiagent system learning, Control, Learning in Robotics, Learning in perception, Skill acquisition, Active learning, Learning models of environment

    Theory Foundations, Theoretical issues, Evaluation issues, Comparisons, Complexity, Hypothesis selection

    Features/Languages Feature selection, Discretization, Missing value handling, Parameter setting, Constructive induction, Abstraction, Bias issues Cognitive Modeling Cognitive modeling

    33

  • The Information Society Technologies Advisory Group (ISTAG) has recently identified a set of grand research challenges for the preparation of FP7 (July 2004). Among these challenges are

    The 100% safe car A multilingual companion A service robot companion The self-monitoring and self repairing computer The internet police agent A disease and treatment simulator An augmented personal memory A pervasive communication jacket A personal everywhere visualiser An ultra light aerial transportation agent The intelligent retail store

    If perceived from an application perspective, a multilingual companion, an internet police agent or a 100% safe car are vastly different things. Consequently, such systems are investigated in largely unconnected scientific disciplines and will be commercialized in various industrial sectors ranging from health care to automotive.

    34

  • Curs 02 - Modelul general al nvatarii supervizate

    1.1 General Model of Learning from Examples

    A problem of learning is defined by the following components:

    1. A set of three actors: The environment: it is supposed to be stationary and it generates data drawn

    independently and identically distributed (sample i.i.d.) according to a distribution on the space of data X .

    ix

    XD The oracle or supervisor or professor or Nature, who, for each return a desired

    answer or label in agreement with an unknown conditional probability distribution ix

    iu( )| . F u x The learner or learning machine (LM) able to fulfill a function (not necessarily

    deterministic) belonging to a space of functions H such that the exit produced by LM verifies

    A

    ( )i iy h= x for hH .

    1

  • 2. The learning task: LM seek in the space a function who as well as possible

    approximate the desired response of supervisor. In the case of induction, the distance between the hypothesis function h and the response of the supervisor is defined by the mean loss on the possible situations in

    H h

    = Z X U . Thus, for each entry and response of supervisor , one measures the loss or cost

    ix

    iu ( )( )i evaluating the cost to have taken the decision

    ,il u h x( )iiy h= x when the desired answer was (one will suppose, without

    loss of generality, the loss positive or null). The mean cost, or real risk is then: iu

    ( ) ( )( ) ( ), ,real iR h l u h dF u x xZ =It is a statistical measurement that is a function of the functional dependence ( ),F ux between the entries and desired exits u .This dependence can be expressed by a density of joint probability definite on

    xX U who is unknown. In other words, it is a question of finding a

    hypothesis h near to f in the sense of the loss function, and this is done particularly in the frequently met areas of the space X . As these areas are not know a priori, it is necessary to use the training sample to estimate them, and the problem of induction is thus to seek to minimize the unknown real risk starting from the observation of the training sample S .

    2

  • 3. Finally, an inductive principle that prescribe what the sought function must check, according at the same time to the concept of proximity evoked above and the observed training sample

    h

    ( ) ( ){ }1 1, ,..., ,m mu u=S , with the aim of minimizing the real risk. x xThe inductive principle dictate what the best assumption must check according to the training sample, the loss function and, possibly, other criteria. It acts of an ideal objective. It should be distinguished from the learning method (or algorithm) which describes an effective realization of the inductive principle. For a given inductive principle, there are many learning methods, which result from different choices of solving the computational problems that are beyond the scope of the inductive principle. For example, the inductive principle can prescribe that it is necessary to choose the simplest hypothesis compatible with the training sample. The learning method must then specify how to seek this hypothesis indeed, or a suboptimal hypothesis if it is necessary, by satisfying certain constraints of reliability like computational resources. Thus, for example, the learning method will seek by a gradient method, sub-optimal but easily controllable, the optimum defined by the inductive principle.

    The definition given above is very general: in particular, it does not depend on the selected loss function. It has the merit to distinguish the principal ingredients of a learning problem that are often mixed in practical achievements descriptions.

    3

  • 1.1.1 The Theory of Inductive Inference

    The inductive principle prescribes which assumption one should choose to minimize the real risk based on the observation of a training sample. However, there is no unique or ideal inductive principle single or ideal. How to extract, starting from the data, a regularity which has chances to have a relevance for the future? A certain number of "reasonable" answers were proposed. We describe the principal ones in a qualitative way here before more formally re-examining them in this and next chapters. The choice of the hypothesis minimizing the empirical risk (Empirical Risk Minimization or the ERM principle). The empirical risk is the average loss measured on the training sampleS :

    ( ) ( )( )1

    1 ,m

    emp i ii

    R h l u hm =

    = x The idea subjacent of this principle is that the hypothesis, which agrees best to the data, by supposing that those are representative, is a hypothesis that describes the world correctly in general. The ERM principle was, often implicitly, the principle used in the artificial intelligence since the origin, as well in the connectionism as in the learning symbolic system. What could be more natural indeed than to consider that a regularity observed on the known data will be still verified by the phenomenon that produced these data? It is for example the guiding principle of the

    4

  • perceptron algorithm like that of the ARCH system. In these two cases, one seeks a coherent hypothesis with the examples, i.e. of null empirical risk. It is possible to refine the principle of the empirical risk minimization while choosing among the optimal hypothesis, either one of most specific, or one of most general.

    5

  • The choice of the most probable hypothesis being given the training sample. It is the Bayesian decision principle. The idea is here that it is possible to define a probability distribution on the hypothesis space and that the knowledge preliminary to the learning can be expressed in particular in the form of an a priori probability distribution on the hypotheses space. The sample of learning is then regarded as information modifying the probability distribution on H (see Figure Error! No text of specified style in document.-1). One can then, or to choose the most probable a posteriori hypothesis (the maximum likelihood principle) or Maximum A posteriori (MAP), or to adopt a composite hypothesis resulting from the average of the hypotheses weighed by their a posteriori probability (true Bayesian approach).

    Figure Error! No text of specified style in document.-1 The space of the assumptions is presumably

    provided with a density of probabilities a priori. The learning consists in modifying this density according to H

    6

    the learning example.

  • The choice of a hypothesis that compresses as well as possible the information contained in the training sample. We will call this precept: the information compression principle. The idea is to eliminate the redundancies present in the data in order to extract the subjacent regularities allowing an economic description of the world. It is implied that the regularities discovered in the data are valid beyond the data and apply to the whole world. The question is to know if these ideas intuitively tempting make it possible to learn effectively. More precisely, we would like to obtain answers to a certain number of naive questions : does the application of the selected inductive principle to minimize the real risk indeed? what conditions should be checked for that? Moreover, the conditions must be verified on

    the training sample, or on the target functions, or by the supervisor, or on the hypotheses space.

    how the performance in generalization depends on the information contained in the training sample, or of its size, etc. ?

    which maximum performance is possible for a given learning problem? which is the best LM for a given learning problem?

    To answer these questions implies choices that depend partly on the type of inductive principle used. It is why we made a brief description of it above.

    8

  • 1.1.2 How to Analyze the Learning? We described the learning, at least the inductive learning, like a problem of optimization: to seek the best hypothesis in the sense of the minimization of the risk mean of a training sample. We want to now study under which conditions the resolution of such a problem is possible. We want also to have tools permitting to judge performance of an inductive principle or of a learning algorithm. This analysis requires additional assumptions, which correspond to options on what is awaited from the LM. Thus, a learning problem depends on the environment, which generates data according to a certain unknown distribution , of the supervisor, which chooses a target function

    ixXD f , and of

    the selected loss function l . The performance of the LM (which depends on the selected inductive principle and the learning algorithm carrying it out) will be evaluated according to the choices of each one of these parameters. When we seek to determine the expected performance of the LM, we must thus discuss the source of these parameters. There are in particular three possibilities: 1. It is supposed that one does not know anything a priori on the environment, therefore neither

    on the distribution of the learning data, nor on the target dependence, but one wants to guard oneself against the worst possible situations, like if the environment and supervisor were adversaries. One then seeks to characterize the performance of learning in the worst possible

    9

  • situations, which generally is expressed in intervals of the risk. It is the analysis in the worst case. One also speaks about the framework of Min Max analysis, by reference to the game theory. The advantage from this point of view is that the guarantees of possible performances will be independent of the environment (the real risk being calculated whatever the distribution of the events) and of supervisor or Nature (i.e. whatever the target function). On the other hand, the conditions identified to obtain such guarantees will be so strong that they will be often very far away from the real situations of learning.

    2. One can on the contrary want to measure a mean of performance. In this case, it should be supposed that there is a distribution on the learning data, but also a distribution on XD FDpossible target functions. The analysis that has results is the analysis in the average case. One also speaks about Bayesian framework. This analysis allows in theory a finer characterization of the performance, at the price however to have to make a priori assumptions on the spaces X and F . Unfortunately, it is often very difficult analytically to obtain guarantees conditions of successful learning, and it is generally necessary to use methods of approximation, which remove a part of the interest of such an approach.

    3. Finally, one could seek to characterize the most favorable case, when environment and supervisor are benevolent and want to help the LM. But it is difficult to determine the border between the benevolence, that of a professor for example, and the collusion who would see the supervisor then acting like an accomplice and coding the target function in a known code of the learner, which would not be any more learning, but an illicit transmission. This is why this type of analysis, though interesting, does not have yet a well-established framework.

    10

  • 1.1.3 Validity Conditions for the ERM Principle In this section, we concentrate on the analysis of the inductive principle ERM who prescribes to choose hypothesis minimizing the empirical risk measured on the learning sample. It is indeed the most employed rule, and its analysis leads to very general conceptual principles. The ERM principle has initially been the subject of an analysis in the worst case, which we describe here. An analysis in the average case, utilizing ideas of statistical physics, also was the object of many very interesting works. It is however technically definitely more difficult. Let us recall that the learning consists in seeking a hypothesis such that it minimizes the learning average loss. Formally, it is a question of finding an optimal hypothesis *h minimizing the real risk:

    h

    ( )* ArgMin realh

    h R h

    =H

    The problem is that one does not know the real risk attached to each hypothesis . The natural idea is thus to select hypothesis h in H who behaves well on the learning data S : it is the inductive principle of the ERM. We will note this optimal hypothesis for the empirical risk measured on the sample :

    h

    ShS

    ( ) ArgMinS emph

    h R h

    =H

    11

  • This inductive principle will be relevant only if the empirical risk is correlated with the real risk. Its analysis must thus attempt to study the correlation between the two risks and more particularly the correlation between the real risk incurred with the selected hypothesis using the ERM principle, ( ) and the optimal real risk ( ) emp SR h *realR hThis correlation will utilize two aspects:

    1. The difference (inevitably positive or null) between the real risk of the hypothesis selected using the training sample and the real risk of the optimal hypothesis :

    Sh

    S *h( ) ( ) . *real S realR h R h2. The probability that this difference is higher than a given bound . Being given indeed that

    the empirical risk depends on the training sample, the correlation between the measured empirical risk and the real risk depend on the representativeness of this sample. This is why also, when the difference ( ) ( ) is studied is necessary to take into account the probability of the training sample being given a certain target function. One cannot be a good learner of all the situations, but only for the reasonable one (representative training samples) which are most probable.

    *real real SR h R h

    12

  • Thus, let us take again the question of the correlation between the empirical risk and the real risk. The ERM principle is a valid inductive principle if, the real risk computed with the hypothesis that minimize the empirical risk, is guaranteed to be close to the optimal real risk obtained with the optimal hypothesis . This closeness must happen in the large majority of the situations that can occur, i.e. for the majority of the samples of learning drawn by chance according to the distribution .

    Sh*h

    XDIn a more formal way, one seeks under which conditions it would be possible to ensure:

    ( ) ( ) ( )( )*0 , 1: real S realP R h R h (1) z w k, for i = l, 2, ... , m. Furthermore, when , this is fixed at a positive constant learning rule converges in finite time.

    2

  • Another variant of the perceptron learning rule is given by the batch update procedure

    ( )

    1

    1

    arbitrary

    k

    k k +

    = +

    z Z w

    ww w z (2.18)

    ( )kZ w is the set of patterns z misclassified by . Here, the weight vector change kwwhere 1k+ = w w wk is along the direction of the resultant vector of all misclassified patterns. In

    general, this update procedure converges faster than the perceptron rule, but it requires more

    storage.

    In the nonlinearly separable case, the preceding algorithms do not converge. Few theoretical

    results are available on the behavior of these algorithms for nonlinearly separable problems [see

    Minsky and Papert (1969) for some preliminary results]. For example, it is known that the length

    of w in the perceptron rule is bounded, i.e., tends to fluctuate near some limiting value *w . This

    information may be used to terminate the search for w*. Another approach is to average the

    3

  • weight vectors near the fluctuation point . Butz (1967) proposed the use of a reinforcement *w

    factor , 1, in the perceptron learning rule. This reinforcement places w in a region that 0 tends to minimize the probability of error for nonlinearly separable cases. Butz's rule is as

    follows:

    ( )( )

    1

    T1

    T1

    arbitrary

    if 0

    if 0

    k k k k k

    k k k k k

    +

    +

    = + = + >

    w

    w w z z w

    w w z z w

    (2.19)

    2.1.2The Perceptron Criterion Function

    It is interesting to see how the preceding error-correction rules can be derived by a gradient

    descent on an appropriate criterion (objective) function. For the perceptron, we may define the

    following criterion function (Duda and Hart, 1973):

    4

  • ( )( )

    T

    ZJ

    =

    z ww z w (2.20)

    ( )Z w ( )Z wT 0z w is the set of samples misclassified by w (i.e., where ). Note that if is empty, ( ) 0J =w ( ) 0J >wthen ; otherwise, . Geometrically, is proportional to the sum of the ( )J w

    distances from the misclassified samples to the decision boundary. The smaller J is, the better

    the weight vector w will be.

    ( )J wGiven this objective function , the search point can be incrementally improved at each kw( )J w in w space. Specifically, we may iteration by sliding downhill on the surface defined by

    use J to perform a discrete gradient-descent search that updates so that a step is taken kw

    ( )J wdownhill in the "steepest" direction along the search surface at . This can be achieved kw

    5

  • kw proportional to the gradient of J at the present location ; formally, we may kwby making 12write

    ( )T

    1

    1 2 1

    | ... |k kk k kn

    J J JJw w w

    + = =+

    = = w w w ww w w w (2.21)

    1Here, the initial search point w are to be specified by the and the learning rate (step size) user. Equation (2.21) can be called the steepest gradient descent search rule or, simply, gradient

    descent. Next, substituting the gradient

    1 Discrete gradient-search methods are generally governed by the following equation:

    1 | kk k J+ == w ww w AHere, A is an nn matrix and is a real number, both are functions of . Numerous versions of gradient-search methods exist, and they differ in the way in which A and kw are selected at k=w w . For example, if A is taken to be the identity matrix, and if is set to a small positive constant, the gradient "descent" search in Equation (2.21) is obtained. On the other hand, if is a small negative constant, gradient "ascent" search is realized which seeks a local maximum. In either case, though, a saddle point (nonstable equilibrium) may be reached. However, the existence of noise in practical systems prevents convergence to such nonstable equilibriums. It also should be noted that in addition to its simple structure, Equation (2.21) implements "steepest" descent. It can be shown that starting at a point w, the gradient direction

    ( ) yields the greatest incremental increase of ( )0J w J w for a fixed incremental distance 0 0 = w w w . The speed of convergence of steepest descent search is affected by the choice of , which is normally adjusted at each time step to make the most error correction subject to stability constraints. [ ] 1J Finally, it should be pointed out that setting A equal to the inverse of the Hessian matrix and to 1 results in the well-known Newton's search method.

    6

  • ( ) ( )kk

    Z

    J

    = z w

    w z (2.22)

    into Equation (2.21) leads to the weight update rule

    ( )1

    k

    k k

    Z

    +

    = + z w

    w w z (2.23)

    The learning rule given in Equation (2.23) is identical to the multiple-sample (batch) perceptron

    rule of Equation (2.18). The original perceptron learning rule of Equation (2.3) can be thought of

    as an "incremental" gradient descent search rule for minimizing the perceptron criterion function

    in Equation (2.20). Following a similar procedure as in Equations (2.21) through (2.23) it can be

    shown that

    ( ) ( )T

    T

    b

    J b

    = (2.24) z w

    w z w

    is the appropriate criterion function for the modified perceptron rule in Equation (2.16).

    7

  • Before moving on, it should be noted that the gradient of J in Equation (2.22) is not

    mathematically precise. Owing to the piecewise linear nature of J, sudden changes in the

    ( )T 0k =z wgradient of J occur every time the perceptron output y goes through a transition at . ( )T 0k =z wTherefore, the gradient of J is not defined at "transition" points w satisfying , k = 1,

    2, ... , m. However, because of the discrete nature of Equation (2.21), the likelihood of kw

    Joverlapping with one of these transition points is negligible, and thus we may still express as in Equation (2.22).

    2.1.3Mays' Learning Rule

    The criterion functions in Equations (2.20) and (2.24) are by no means the only functions that

    are minimized when w is a solution vector. For example, an alternative function is the quadratic

    function

    ( ) ( )T

    2T12 b

    J b

    = (2.25) z w

    w z w

    8

  • where b is a positive constant margin. Like the previous criterion functions, the function J(w) in

    Equation (2.25) focuses attention on the misclassified samples. Its major difference is that its

    gradient is continuous, whereas the gradient of the perceptron criterion function, with or without

    the use of margin, is not. Unfortunately, the present function can be dominated by the input

    vectors with the largest magnitudes. We may eliminate this undesirable effect by dividing by 2z :

    ( ) ( )T

    2T

    212 b

    bJ

    = z w

    z ww

    z (2.26)

    The gradient of J(w) in Equation (2.26) is given by

    ( )T

    T

    2b

    bJ

    = z w

    z ww z (2.27) z

    which, upon substituting in Equation (2.21), leads to the following learning rule

    9

  • T1

    T1

    2

    arbitraryk

    k k

    b

    b+

    = + z ww

    z ww w z (2.28)

    z

    If we consider the incremental update version of Equation (2.28), we arrive at Mays' rule (Mays,

    1964):

    ( ) ( )1

    TT1

    2

    1

    arbitrary

    if

    otherwise

    k kk k k k k

    k k

    bb+

    +

    = + =

    w

    z ww w z z w

    zw w

    (2.29)

    If the training set is linearly separable, Mays' rule converges in a finite number of iterations, for

    0 2< < (Duda and Hart, 1973). In the case of a nonlinearly separable training set, the training procedure in Equation (2.29) will never converge. To fix this problem, a decreasing learning rate

    /k k =such as may be used to force convergence to some approximate separating surface. 10

  • Widrow-Hoff ( -LMS) Learning Rule

    Another example of an error correcting rule with a quadratic criterion function is the Widrow-

    Hoff rule (Widrow and Hoff, 1960). This rule was originally used to train the linear unit, also

    known as the adaptive linear combiner element (ADALINE), shown in Figure 2-3. In this case,

    ( )Tk ky = xthe output of the linear unit in response to the input is simply kx w . The Widrow-Hoff rule was proposed originally as an ad hoc rule which embodies the so-called minimal

    disturbance principle. Later, it was discovered (Widrow and Stearns, 1985) that this rule

    converges in the mean square to the solution that corresponds to the least- mean-square *w

    (LMS) output error if all

    1

    Curs 07 - Algoritmul Widrow-Hoff de antrenare a perceptronului

  • Figure 2-3 Adaptive linear combiner clement (ADALINE).

    2

  • kx is the same for all k). Therefore, this rule is input patterns are of the same length (i.e.,

    sometimes referred to as the -LMS rule (the is used here to distinguish this rule from another very similar rule that is discussed in next section). The -LMS rule is given by

    ( )1

    12

    or arbitraryk

    k k k k

    kd y+

    = = +

    w 0xw wx

    (2.30)

    where Rkd is the desired response, and > 0. Equation (2.30) is similar to the perceptron rule if one sets in Equation (2.2) as

    2k

    k

    = =x

    (2.31)

    However, the error in Equation (2.30) is measured at the linear output, not after the nonlinearity,

    as in the perceptron. The constant controls the stability and speed of convergence (Widrow

    3

  • and Stearns, 1985; Widrow and Lehr, 1990). If the input vectors are independent over time,

    0 2<

  • -LMS Learning Rule

    The -LMS learning rule (Widrow and Hoff, 1960) represents the most analyzed and most applied simple learning rule. It is also of special importance due to its possible extension to

    learning in multiple unit neural nets. Therefore, special attention is given to this rule in this

    chapter. In the following, the -LMS rule is described in the context of the linear unit in Figure 2-3. Let

    ( ) ( )21

    12

    mi i

    iJ d y

    == w (2.32)

    be the sum of squared error (SSE) criterion function, where

    ( )Ti iy = x w (2.33) Now, using steepest gradient-descent search to minimize J(w) in Equation (2.32) gives

    5

  • ( )( )

    1

    1

    k k

    mk i

    +

    i i

    i

    J

    d y

    =

    = = +

    w w w

    w x (2.34)

    The criterion function J(w) in Equation (2.32) is quadratic in the weights because of the linear 1iy and w. In fact, J(w) defines a convexrelation between hyperparaboloidal surface with a

    single minimum (the global minimum). Therefore, if the positive constant *w is chosen sufficiently small, the gradient-descent search implemented by Equation (2.34) will

    asymptotically converge toward the solution regardless of the setting of the initial search *w

    point w1. The learning rule in Equation (2.34) is sometimes referred to as the batch LMS rule.

    -LMS or LMS rule, is given by The incremental version of Equation (2.34), known as the

    1 A function of the form : nf R R is said to be convex if the following condition is satisfied: ( ) ( ) ( ) ( )1 1f f f + + u v u v

    nRfor any pair of vectors u and v in and any real number in the closed interval [0,1].

    6

  • ( )1

    1

    0 or arbitraryk k k k kx

    (2.35) d y+

    = = + w

    w w

    Note that this rule becomes identical to the -LMS learning rule in Equation (2.30) upon setting as

    2k

    k

    = =x

    (2.36)

    { }1, 1 n +xAlso, when the input vectors have the same length, as would be the case when , then the -LMS rule becomes identical to the -LMS rule. Since the -LMS learning algorithm

    converges when 20 < < , we can start from Equation (2.36) and calculate the required range on for ensuring the convergence of the -LMS rule for "most practical purposes":

    220

    max ii< +( )Tk kd=x w =, m . Therefore, Equation (2.35) never converges. Thus, for convergence, 1,2,...,k is set to 0 is a small positive constant. In applications such as linear 0 / k , where > 0filtering, though, the decreasing step size is not very valuable, because it cannot accommodate

    nonstationarity in the input signal. Indeed, will essentially stop changing for large k, which kw

    ) LMS learning precludes the tracking of time variations. Thus the fixed-increment (constant rule has the advantage of limited memory, which enables it to track time fluctuations in the input

    data.

    When the learning rate is sufficiently small, the -LMS rule becomes a "good" approximation to the gradient-descent rule in Equation (2.34). This means that the weight vector

    kw will tend to move toward the global minimum of the convex SSE criterion function. *w

    Next, we show that is given by *w

    * = w X d (2.38) 9

  • ( ) 1T =X XXT1 2 ... md d d = d1 2 ... m = X x x x X is the generalized inverse where , , and or pseudoinverse (Penrose, 1955) of X for . 1m n> +The extreme points (minima and maxima) of the function J(w) are solutions to the equation

    ( )J = 0 (2.39) wTherefore, any minimum of the SSE criterion function in Equation (2.32) must satisfy

    ( ) ( ) ( )T T1

    mi i i

    iJ d

    = = = =d 0 (2.40) w x w x X X w

    Equation (2.40) can be rewritten as

    T =XX w Xd (2.41) which for a nonsingular matrix gives the solution in Equation (2.38), or explicitly TXX

    10

  • ( ) 1* T =w XX Xd (2.42) ( )*J =wRecall that just because in Equation (2.42) satisfies the condition *w 0 , this does not

    guarantee that is a local minimum of the criterion function J. It does, however, considerably *w

    narrow the choices in that such represents (in a local sense) either a point of minimum, *w

    maximum, or saddle point of J. To verify that is actually a minimum of .J(w), we may *w

    evaluate the second derivative or Hessian matrix

    2

    i j

    JJw w

    =

    2of J at and show that it is positive definite*w . But this result follows immediately after noting 3that J is equal to the positive-definite matrix . Thus is a minimum of J.*wTXX

    nR2 An real symmetric matrix A is positive-definite if the quadratic form xn n TAx is strictly positive for all nonzero column vectors x in .

    3 Of course, the same result could have been achieved by noting that the convex, unconstrained quadratic nature of J(w) admits one extreme point , which must be the *w

    11

  • The LMS rule also may be applied to synthesize the weight vector w of a perceptron for solving

    two-class classification problems. Here, one starts by training the linear unit in Figure 2- with

    { }, , 1,2,...,k kd k =x m , using the LMS rule. During training, the desired the given training pairs target is set to +1 for one class and to kd 1 for the other class. (In fact, any positive constant can be used as the target for one class, and any negative constant can be used as the target for the

    other class.) After convergence of the learning process, the solution vector obtained may now be

    used in the perceptron for classification. Because of the thresholding nonlinearity in the

    { }1, 1 +perceptron, the output of the classifier will now be properly restricted to the set . When used as a perceptron weight vector, the minimum SSE solution in Equation (2.42) does

    not generally minimize the perceptron classification error rate. This should not be surprising,

    since the SSE criterion function is not designed to constrain its minimum inside the linearly

    separable solution region. Therefore, this solution does not necessarily represent a linear

    separable solution, even when the training set is linearly separable (this is further explored global minimum of J(w).

    12

  • above). However, when the training set is nonlinearly separable, the solution arrived at may still

    be a useful approximation. Therefore, by employing the LMS rule for perceptron training, linear

    separability is sacrificed for good compromise performance on both separable and nonseparable

    problems.

    Example 2.1 This example presents the results of a set of simulations that should help give

    some insight into the dynamics of the batch and incremental LMS learning rules. Specifically,

    we are interested in comparing the convergence behavior of the discrete-time dynamical systems

    in Equations (2.34) and (2.35). Consider the training set depicted in Figure 2-4 for a simple

    mapping problem. The 10 squares and 10 filled circles in this figure are positioned at the points

    whose coordinates specify the two components of the input vectors. The squares and ( )1 2,x x1circles are to be mapped to the targets +1 and , respectively. For example, the left-most square

    [ ]{ }T1,0 ,1in the figure represents the training pair . Similarly, the right most circle represents the [ ]{ }T2,2 , 1training pair .

    13

  • Figure 2-4 A 20-sample training set used in the simulations associated with Example 2.1. Points signified by a square and a filled circle should map into +1 and 1 , respectively. 15

  • Figure 2-5 shows plots for the evolution of the square of the distance between the vector and kw

    the (computed) minimum SSE solution for batch LMS (dashed line) and incremental LMS *w

    (solid line). In both simulations, the learning rate (step size) was set to 0.005. The initial [ ]T0,0search point w1 was set to . For the incremental LMS rule, the training examples are

    selected randomly from the training set. The batch LMS rule converges to the optimal solution *w in less than 100 steps. Incremental LMS requires more learning steps, on the order of 2000

    steps, to converge to a small neighborhood of . *w

    16

  • Figure 2-5 Plots (learning curves) for the square of the distance between the search point and the minimum SSE solution *w generated using two versions of the LMS learning rule. The dashed line corresponds to the batch LMS rule in Equation (2.34). The solid line corresponds to the incremental LMS rule in Equation (2.35) with a random order of presentation of the training patterns. In both cases, w

    kw

    1 = 0 and 0.005 = are used. Note the logarithmic scale for the iteration number k.

    17

  • 2*k w wThe fluctuations in in this neighborhood are less than 0.02, as can be seen from Figure 2-5. The effect of a deterministic order of presentation of the training examples on the

    incremental LMS rule is shown by the solid line in Figure 2-6. Here, the training examples are

    presented in a predefined order, which did not change during training. The same initialization

    and step size are used as before. In order to allow for a more meaningful comparison between

    the two LMS rule versions, one learning step of incremental LMS is taken to mean a full cycle

    through the 20 samples. For comparison, the simulation result with batch LMS learning is

    plotted in the figure (see dashed line). These results indicate a very similar behavior in the

    convergence characteristics of incremental and batch LMS learning. This is so because of the

    small step size used. Both cases show asymptotic convergence toward the optimal solution , *w

    but with a relatively faster convergence of the batch LMS rule near . This is attributed to the *w

    use of more accurate gradient information.

    18

  • Figure 2-6 Learning curves for the batch LMS (dashed line) and incremental LMS (solid line) learning rules for the data in Figure 2-5. The result for the batch LMS rule shown here is identical to the one shown in Figure 2-5 (this result looks different only because of the present use of a linear scale for the horizontal axis). The incremental LMS rule results shown assume a deterministic, fixed order of presentation of the training patterns. Also, for the incremental LMS case, kw represents the weight vector after the completion of the kth learning "cycle." Here, one cycle corresponds to 20 consecutive learning iterations.

    19

  • The -LMS as a Stochastic Process

    Stochastic approximation theory may be employed as an alternative to the deterministic

    gradient-descent analysis presented thus far. It has the advantage of naturally arriving at a klearning-rate schedule for asymptotic convergence in the mean square. Here, one starts with

    the mean-square error (MSE) criterion function:

    ( ) ( )2T12J d= w x w (2.43) where again denotes the mean (expectation) over all training vectors. Now one may compute ithe gradient of J as

    ( ) ( )TJ d x (2.44) = w x wwhich upon setting to zero allows us to find the minimum of J in Equation (2.43) as the *w

    solution of

    20

  • T *

    * 1

    d=

    =xx w x

    w C P (2.45) which gives

    dP x . Note that the expected value of a vector or a matrix is found by TC xxwhere and taking the expected values of its components. We refer to C as the auto-correlation matrix of the

    input vectors and to P as the cross-correlation vector between the input vector x and its

    Cassociated desired target d. In Equation (2.45), the determinant of C, , is assumed different

    from zero. The solution in Equation (2.45) is sometimes called the Wiener weight vector *w

    (Widrow and Stearns, 1985). It represents the minimum MSE solution, also known as the least-

    mean-square (LMS) solution.

    It is interesting to note here the close relation between the minimum SSE solution in Equation

    (2.42) and the LMS or minimum MSE solution in Equation (2.45). In fact, one can show that

    when the size of the training set m is large, the minimum SSE solution converges to the

    minimum MSE solution.

    21

  • First, let us express XXT as the sum of vector outer products . We can also rewrite ( )T1

    mk k

    k=x x

    Xd as . This representation allows us to express Equation (2.42) as 1

    mk k

    kd

    = x

    ( ) 1T*1 1

    m mk k k k

    k kd

    = =

    = x w x xNow, multiplying the right-hand side of the preceding equation by m/m allows us to express it as

    ( ) 1T*1 1

    1 1m mk k k kk k

    dm m

    = =

    = w x x xFinally, if m is large, the averages

    ( )T1

    1 m k kkm =x x

    1

    1 m k kk

    dm = x and

    22

  • d=P x , respectively. T=C xxbecome very good approximations of the expectations and Thus we have established the equivalence of the minimum SSE and minimum MSE for a large

    training set.

    Next, in order to minimize the MSE criterion, one may employ a gradient-descent procedure

    where, instead of the expected gradient in Equation (2.44), the instantaneous gradient

    ( )Tk k kd x w kx is used. Here, at each learning step the input vector x is drawn at random. This leads to the stochastic process

    ( )T1k k k k k k kx (2.46) d+ = + w w x wk . It which is the same as the -LMS rule in Equation (2.35) except for a variable learning rate

    k0Ccan be shown that if and satisfies the three conditions

    23

  • 1. 0k (2.47a)

    12. lim

    mk

    m k = = + (2.47b)

    ( )21

    3. limm

    k

    m k = < (2.47c)

    then converges to in Equation (2.45) asymptotically in the mean square; i.e., kw *w

    2*lim 0kk =w w (2.48)

    ( ),g w x and is known as a regression The criterion function in Equation (2.43) is of the form function. The iterative algorithm in Equation (2.46) is also known as a stochastic approximation

    procedure (or Kiefer-Wolfowitz or Robbins-Monro procedure). For a thorough discussion of

    stochastic approximation theory, the reader is referred to Wasan (1969).

    24

  • Curs 08 - Correlation Learning Rule / The Delta Rule

    Correlation Learning Rule

    The correlation learning rule is derived by starting from the criterion function

    ( )1

    mi iJ y

    id

    == x (2.49)

    where ( )Ti iy = x w , and performing gradient descent to minimize J. Note that minimizing J(w) is equivalent to maximizing the correlation between the desired target and the corresponding linear

    unit's output for all , i = l, 2, ... , m. Now, employing steepest gradient descent to minimize J(w) ix

    leads to the learning rule:

    1

    1k k k kx (2.50)

    d+ = = +

    w 0w w

    By setting to 1 and completing one learning cycle using Equation (2.50), we arrive at the weight vector given by *w

    Adaptive Ho-Kashyap (AHK) Learning Rules

  • Supervised Learning of a Perceptron

    2

    *

    1

    mi i

    id

    == =w x Xd (2.51)

    where X and d are as defined above. Note that Equation (2.51) leads to the minimum SSE solution

    in Equation (2.38) if X . This is only possible if the training vectors are encoded such that =X kxXXT is the identity matrix (i.e., the vectors are orthonormal). kx

    Another version of this type of learning is the covariance learning rule. This rule is obtained by

    steepest gradient descent on the criterion function

    ( ) ( )( )1

    mi i

    iJ y y d d

    == w .

    Here, y and d are computed averages, over all training pairs, for the unit's output and the desired

    target, respectively. Covariance learning provides the basis of the cascade-correlation net.

  • Supervised Learning of a Perceptron

    3

    The Delta Rule

    The following rule is similar to the -LMS rule except that it allows for units with a differentiable nonlinear activation function f. Figure 2-7 illustrates a unit with a sigmoidal activation function.

    Here, the unit's output is y = f(net), with net defined as the vector inner product . Tx w

    Figure 2-7 A perceptron with a differentiable sigmoidal activation function.

  • Supervised Learning of a Perceptron

    4

    Again, consider the training pairs { },i idx , i= l, 2, ... , m, with n 1i +Rx ( 1 1inx + = for all i) and [ ]1, 1id + . Performing gradient descent on the instantaneous SSE criterion function

    ( ) ( 212

    )J d y= w ,

    whose gradient is given by

    ( ) ( ) ( )J d y f net = w x (2.52) leads to the delta rule:

    ( ) ( )1

    1

    arbitraryk k k k k k k k k x (2.53) d f net f net +

    = + = + w

    w w x w

    where ( )Tk k kw and net = x dff d net = . If f is defined by ( ) ( )tanhf net net= , then its derivative is given by ( ) ( )21f net f net . For the "logistic" function, = ( ) ( )1/ 1 netf net e = + , the derivative is ( ) ( ) ( )1f net f net f net = . Figure 2-8 plots f and f for the hyperbolic tangent activation function with 1 = . Note how f asymptotically approaches +1 and 1 in the limit as net approaches + and , respectively.

  • Supervised Learning of a Perceptron

    6

    Figure 2-8 Hyperbolic tangent activation function f and its derivative f , plotted for 3 3net + .

  • Supervised Learning of a Perceptron

    7

    One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of

    ( )net in Figure 2-8. In particular, notice how ( ) 0f net when net has large magnitude (i.e., f3net > ); these regions are called flat spots of f . In these flat spots, we expect the delta learning

    rule to progress very slowly (i.e., very small weight changes even when the error ( yd ) is large), because the magnitude of the weight change in Equation (2.53) directly depends on the magnitude of

    ( )f net . Since slow convergence results in excessive computation time, it would be advantageous to try to eliminate the flat spot phenomenon when using the delta learning rule. One common flat spot

    elimination technique involves replacing f by f plus a small positive bias . In this case, the weight update equation reads as

    ( ) ( )1k k k k k k x (2.54) d f net f net+ = + + w w

  • Supervised Learning of a Perceptron

    8

    One of the primary advantages of the delta rule is that it has a natural extension that may be used to

    train multilayered neural nets. This extension, known as error back propagation, will be discussed

    in Chapter 3.

    Adaptive Ho-Kashyap (AHK) Learning Rules

    Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as

    enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning

    rules, AHK I, AHK II, and AHK III, are derived based on gradient-descent strategies on an

    appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well

    suited for generating robust decision surfaces for linearly separable problems. The third training rule,

    AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable

    problems. The three AHK learning rules preserve the simple incremental nature found in the LMS

    and perceptron learning rules. The AHK rules also possess additional processing capabilities, such

  • Supervised Learning of a Perceptron

    9

    as the ability to automatically identify critical cluster boundaries and place a linear decision surface

    in such a way that it leads to enhanced classification robustness.

    Consider a two-class { }1 2,c c classification problem with m labeled feature vectors (training vectors) { },i dx i , i = 1,2,..., m. Assume that belongs to ix 1nR + (with the last component of being a constant ixbias of value 1) and that ( )1 1id = + if ( )1 2i c cx . Then, a single perceptron can be trained to correctly classify the preceding training pairs if an ( )1n + -dimensional weight vector w is computed that satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's

    activation function):

    ( )T 0 if 1 for 1,2,...,0 if 1

    ii

    i

    di m (2.55)

    d> = + =< =

    x w

    Next, if we define a set of m new vectors according to iz

  • Supervised Learning of a Perceptron

    10

    if 1for 1,2,...,

    if 1

    i ii

    i i

    di m= (2.56)

    d+ = += =

    xz

    x

    and we let

    1 2 ... m = Z z z z (2.57)

    then Equation (2.55) may be rewritten as the single matrix equation

    T >Z w 0 (2.58) Now, defining an m-dimensional positive-valued margin vector b (b > 0) and using it in Equation

    (2.58), we arrive at the following equivalent form of Equation (2.55):

    T =Z w b (2.59)

  • Supervised Learning of a Perceptron

    11

    Thus the training of the perceptron is now equivalent to solving Equation (2.59) for w, subject to the

    constraint b > 0. Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation

    (2.59). In the Ho-Kashyap algorithm, the components of the margin vector are first initialized to

    small positive values, and the pseu-doinverse is used to generate a solution for w (based on the

    initial guess of b) that minimizes the SSE criterion function ( ) 2T1,2

    J = w b Z w b :

    = w Z b (2.60)

    where ( ) 1T =Z ZZ Z , for . Next, a new estimate for the margin vector is computed by 1m n> +performing the constrained (b > 0) gradient descent

    1 12

    k k + = + + b b with kTk k Z w b (2.61) =

  • Supervised Learning of a Perceptron

    12

    where i denotes the absolute value of the components of the argument vector, and is the kb"current" margin vector. A new estimate of w can now be computed using Equation (2.60) and

    employing the updated margin vector from Equation (2.61). This process continues until all the

    components of are zero (or are sufficiently small and positive), which is an indication of linear separability of the training set, or until 0 < , which is an indication of nonlinear separability of the training set (no solution is found). It can be shown (Ho and Kashyap, 1965) that the Ho-Kashyap

    procedure converges in a finite number of steps if the training set is linearly separable. For

    simulations comparing the preceding training algorithm with the LMS and perceptron training

    procedures, the reader is referred to Hassoun and Clark (1988). This algorithm will be referred to

    here as the direct Ho-Kashyap (DHK) algorithm.

    The direct synthesis of the w estimate in Equation (2.60) involves a one-time computation of the

    pseudoinverse of Z. However, such computation can be computationally expensive and requires

    special treatment when is ill-conditioned (i.e., the determinant TZZ TZZ close to zero). An

  • Supervised Learning of a Perceptron

    13

    alternative algorithm that is based on gradient-descent principles and which does not require the

    direct computation of can be derived. This derivation is presented next. Z

    Starting with the criterion function ( ) 2T1,2

    J = w b Z w b , gradient descent may be performed with respect to b and w so that J is minimized subject to the constraint b > 0. The gradient of J with

    respect to w and b is given by

    ( ) ( )T,, | k k k k (2.62a) J = b w bw b Z w b( ) ( )1 T 1,, | k k k kJ + + = w w bw b Z Z w b (2.62b)

    where the superscripts k and k + 1 represent current and updated values, respectively. One analytic

    method for imposing the constraint b > 0 is to replace the gradient in Equation (2.62a) by

    ( )0.5 + , with as defined in Equation (2.61). This leads to the following gradient-descent formulation of the Ho-Kashyap procedure:

  • Supervised Learning of a Perceptron

    14

    ( )1 12k k k + = + + with kb b Tk k Z w b (2.63a) =

    and ( )1 T 12

    1 2

    1

    212

    k k k k

    k k k

    + += (2.63b) = + +

    w w Z Z w b

    w Z

    where 1 and 2 are strictly positive constant learning rates. Because of the requirement that all training vectors (or ) be present and included in Z, this procedure is called the batch-mode kz kx

    adaptive Ho-Kashyap (AHK) procedure. It can be easily shown that if 1 0 = and 1 =b 1 , Equation (2.63) reduces to the -LMS learning rule. Furthermore, convergence can be guaranteed (Duda and Hart, 1973) if 10 2< < and max20 2 /< < where max is the largest eigenvalue of the positive definite matrix . TZZ

  • Supervised Learning of a Perceptron

    15

    A completely adaptive Ho-Kashyap procedure for solving Equation (2.59) is arrived at by starting

    from the instantaneous criterion function

    ( ) ( )T1, 2 i iJ = w b z w b which leads to the following incremental update rules:

    ( )1 12k k k kii i ib b + = + + with ( )Tk i k ki ib z w (2.64a) =

    and ( )T1 12

    1 2

    1

    212

    k k i i k ki

    k k k i (2.64b) i i

    b

    + + = = + +

    w w z z w

    w z

    Here, , represents a scalar margin associated with the input. In all the preceding Ho-Kashyap ibix

    learning procedures, the margin values are initialized to small positive values, and the perceptron

  • Supervised Learning of a Perceptron

    16

    weights are initialized to zero (or small random) values. If full margin error correction is assumed in

    Equation (2.64a), i.e., 1 1 = , the incremental learning procedure in Equation (2.64) reduces to the heuristically derived procedure reported in Hassoun and Clark (1988). An alternative way of writing

    Equation (2.64) is

    1k

    i ib = and ( )2 1 1 k i i = w z if (2.65a) 0ki >0ib = and i2 k i = w z if 0ki (2.65b)

    where and signify the difference between the updated and current values of b and w, b wrespectively. This procedure is called the AHK I learning rule. For comparison purposes, it may be

    noted that the -LMS rule in Equation (2.35) can be written as ik i = w z , with , held fixed at ib+1.

  • Supervised Learning of a Perceptron

    17

    The implied constraint in Equations (2.64) and (2.65) was realized by starting with a positive 0ib >initial margin and restricting the change b to positive real values. An alternative, more flexible way to realize this constraint is to allow both positive and negative changes in b , except for the cases where a decrease in b, results in a negative margin. This modification results in the following

    alternative AHK II learning rule:

    1k

    i ib = and ( )2 1 1 k i i = w z if 1 0k ki ib + > (2.66a) 0ib = and i2 k i = w z if 1 0k ki ib + (2.66b)

    In the general case of an adaptive margin, as in Equation (2.66), Hassoun and Song (1992) showed

    that a sufficient condition for the convergence of the AHK rules is given by

    2 220

    max ii

    < such that ( )dg sds

    for all ; i.e., g keeps pushing if there is a 0s misclassification.

    3. g(s) is bounded from below.

    For a single unit with weight vector w, it can be shown (Wittner and Denker, 1988) that if the

    criterion function is well formed, then gradient descent is guaranteed to enter the region of linearly

    separable solutions , provided that such a region exists. *w

  • Table 2-1 Summary of Basic Learning Rules

    Learning Rule Criterion Function

    1 Learning Vector2 Conditions Activation

    Function3Remarks

    Perceptron

    rule (supervised)

    ( )T

    T

    0

    J

    = z w

    w z w ( )Tif 00 otherwise

    k k k z z w

    0 > ( ) ( )sgnf net net= Finite convergence time if

    training set is linearly separable. w stays bounded for arbitrary training sets.

    Perceptron rule with variable learning rate and fixed margin (supervised)

    ( ) ( )T

    T

    b

    J b

    = z w

    w z w ( )Tif0 otherwise

    k k k b z z w

    0b >

    k satisfies: 1. 0k

    12.

    mk

    k

    ==

    ( ) ( )sgnf net net= Converges to if training set is linearly separable. Finite convergence if k

    T b>z w

    = , where is a finite positive constant.

    1 Note: if 1

    if 1

    k kk

    k k

    dd

    = += = x

    zx

    2 The general form of the learning equation is k k1k k + = +w w s , where k is the learning rate and is the learning vector. ks3 Tnet = x w

    27

  • Supervised Learning of a Perceptron

    28

    ( )21

    2

    1

    3. 0

    mk

    k

    mk

    k

    =

    =

    =

    May`s rule (supervised) ( ) ( )

    T

    2T

    212 b

    bJ

    = z w

    z ww

    z

    ( ) ( )T

    T

    2 if

    0 otherwise

    k kk k k

    bb

    z wz z w

    z

    0 2< < 0b >

    ( ) ( )sgnf net net= Finite convergence to the solution T 0b >z w if the training set is linearly separable.

    Butz`s rule (supervised) ( ) ( )TiiJ = w z w ( )

    Tif 0

    otherwise

    k k k

    k

    z z w

    z

    0 1 < 0 >

    ( ) ( )sgnf net net= Finite convergence if training set is linearly separable. Places w in a region that tends to minimize the probability of error for nonlinearly separable cases.

    Widrow-Hoff rule ( -LMS) (supervised)

    ( ) ( )2T

    212

    i i

    ii

    dJ

    = x wwx

    ( )Tk k k kd x w x 2k k = x

    0 2< <

    ( )f net net= Converges in the mean square to the minimum SSE or LMS solution if

    i j=x x for al i, j.

  • Supervised Learning of a Perceptron

    29

    -LMS (supervised) ( ) ( )

    2T12

    i i

    iJ d = w x w ( )

    Tk k k kd x w x 220

    3< ( )f net net= Converges to the minimum SSE solution if the vectors kx are mutually

    orthonormal.

    Delta rule (supervised) ( ) ( )

    ( )2

    T

    12

    i i

    i

    i i

    J d y

    y

    = wx w

    ( ) ( )k k k kd y f net x 0 1< < ( )y f net= where f

    is a sigmoid function Extends the -LMS rule to cases with differentiable nonlinear activations.

  • Supervised Learning of a Perceptron

    30

    Learning

    Rule Criterion Function4 Learning Vector5 Conditions

    Activation

    Function6Remarks

    Minkowski-r

    delta rule

    (supervised)

    ( ) 1 ri ii

    J d yr

    = w ( ) ( )1sgn rk k k k k kd y d y f net x

    0 1< < ( )y f net= where f is a sigmoid function

    0 2r< < for pseudo-Gaussian distribution

    with pronounced tails. r = 2

    gives delta rule. r = 1 arises

    when

    ( )p x

    ( )p x is a Laplace distribution.

    4 Note: if 1

    if 1

    k kk

    k k

    dd

    = += = x

    zx

    5 The general form of the learning equation is k k1k k + = +w w s , where k is the learning rate and is the learning vector. ks6 Tnet = x w

  • Supervised Learning of a Perceptron

    31

    Relative

    entropy delta

    rule

    (supervised)

    ( )( )( )

    11 ln11

    2 11 ln1

    ii

    i

    ii ii

    J

    ddy

    ddy

    ++ + = +

    w

    ( )k k kd y x 0 1< < ( )tanhy net= Eliminates the flat spot suffered by the delta rule.

    Converges to one linearly

    separable solution if one

    exists.

    AHK I

    (supervised) ( ) ( ) 2T1, 2 i iiJ = Margin se w b z w b

    k ki i

    ib > =

    1 if 00 otherwi

    Weight vector:

    ( )

    ( )

    22 1

    2

    T

    1 if 0if 0

    k ii i

    k i ki i

    k i k ki ib

    > =

    zz

    z w

    1 0>b

    10 2< <

    2 220

    max ii

    < 0

    kib 1 0>b

    10 2< <

    ( ) ( )sgnf net net= ib values can take any positive value. Converges to

    a robust solution for linearly

    separable problems.

  • Supervised Learning of a Perceptron

    32

    1 1if

    0 otherwise

    kk k ii i

    b >=

    Weight vector:

    ( )

    ( )

    22 1

    1

    21

    T

    1 if

    if

    kk i ii i

    kk i k ii i

    k i k ki i

    b

    b

    b

    > =

    z

    z

    z w

    2 220

    max ii

    < 0

    kib

    1 1if

    0 otherwise

    kk k ii i

    b >=

    Weight vector:

    1 0>b

    10 2< <

    2 220

    max ii

    <

    =

    z

    z w

    misclassifications.

    Delta rule for

    stochastic

    units

    (supervised)

    ( ) ( )212 i iiJ d y= w ( )( )2tanh net

    1 tanh

    k k

    k k

    d

    net

    x

    i

    0 1< < Stochastic activation:

    ( )y

    ( )

    with1

    1with

    11 1

    P y

    P y

    + = += = +( ) 211 1 netP y e = = +

    Performance in the average

    is equivalent to the delta

    rule applied to a unit with

    deterministic activation:

  • Learning Rules for Multilayer Feedforward Neural Networks

    This chapter extends the gradient-descent-based delta rule of Chapter 2 to multilayer feedforward

    neural networks. The resulting learning rule is commonly known as error backpropagation (or

    backprop), and it is one of the most frequently used learning rules in many applications of artificial

    neural networks.

    The backprop learning rule is central to much current work on learning in artificial neural networks.

    In fact, the development of backprop is one of the main reasons for the renewed interest in artificial

    neural networks. Backprop provides a computationally efficient method for changing the weights in

    a feedforward network, with differentiable activation function units, to learn a training set of input-

    output examples. Backprop-trained multilayer neural nets have been applied successfully to solve

    some difficult and diverse problems, such as pattern classification, function approximation,

    nonlinear system modeling, time-series prediction, and image compression and reconstruction. For

    these reasons, most of this chapter is devoted to the study of backprop, its variations, and its

    extensions.

    1

    Curs 10 - Retele de perceptroni

  • Backpropagation is a gradient-descent search algorithm that may suffer from slow convergence to

    local minima. In this chapter, several methods for improving back-prop's convergence speed and

    avoidance of local minima are presented. Whenever possible, theoretical justification is given for

    these methods. A version of backprop based on an enhanced criterion function with global search

    capability is described which, when properly tuned, allows for relatively fast convergence to good

    solutions.

    Consider the two-layer feedforward architecture shown in Figure 0-1. This network receives a set of

    scalar signals { }0 1, ,..., nx x x where is a bias signal equal to 1. This set of signals constitutes an 0xinput vector 1nR +x . The layer receiving the input signal is called the hidden layer. Figure 0-1 shows a hidden layer having J units. The output of the hidden layer is a (J + l)-dimensional real-

    valued vector [ ]T0 1, ,..., Jz z z=z . Again, 0 1z = represents a bias input and can be thought of as being generated by a "dummy" unit (with index zero) whose output is clamped at 1. The vector z 0z

    supplies the input for the output layer of L units. The output layer generates an L-dimensional vector

    2

  • y in response to the input x which, when the network is fully trained, should be identical (or very

    close) to a "desired" output vector d associated with x.

    Figure 0-1 A two-