neural computing

60
Neural computing ICO4168 Wassner Hubert [email protected] http://professeurs.esiea.fr/wassner/ « Hans the clever » : an old story to explain how difficult and surprising it can be to teach a trick to something. Table of contents Course présentation............................................................................................4 Prerequisits .....................................................................................................4 Resume ...........................................................................................................4 Presented techniques.......................................................................................4 Objectives........................................................................................................4 Introduction.........................................................................................................5 Foreword on « fractal approach »....................................................................5 Bionics (From Wikipedia, the free encyclopedia).............................................6 Neuron (From Wikipedia, the free encyclopedia).............................................7 Contents.......................................................................................................8 History.........................................................................................................8 Anatomy and histology.................................................................................8 Classes.......................................................................................................10 p. 1/60 Wassner Hubert [email protected]

Upload: chekkouryyassine

Post on 28-Sep-2015

220 views

Category:

Documents


1 download

DESCRIPTION

Neural Computing

TRANSCRIPT

  • Neural computing

    ICO4168

    Wassner Hubert

    [email protected]

    http://professeurs.esiea.fr/wassner/

    Hans the clever : an old story to explain how difficult and surprising it can be to teach a trick to something.

    Table of contentsCourse prsentation............................................................................................4

    Prerequisits .....................................................................................................4Resume ...........................................................................................................4Presented techniques.......................................................................................4Objectives........................................................................................................4

    Introduction.........................................................................................................5Foreword on fractal approach ....................................................................5Bionics (From Wikipedia, the free encyclopedia).............................................6Neuron (From Wikipedia, the free encyclopedia).............................................7

    Contents.......................................................................................................8History.........................................................................................................8Anatomy and histology.................................................................................8Classes.......................................................................................................10

    p. 1/60 Wassner Hubert [email protected]

  • Connectivity...............................................................................................10Adaptations to carrying action potentials...................................................11Challenges to the neuron doctrine.............................................................12Neurons in the brain..................................................................................12See also......................................................................................................13Sources......................................................................................................13External links.............................................................................................13

    Artificial neuron (From Wikipedia, the free encyclopedia).............................14Contents.....................................................................................................14Basic structure...........................................................................................14History.......................................................................................................15Types of transfer functions.........................................................................15Bibliography...............................................................................................16

    Mathematical model and properties of Artificial Neural Networks................17Mathematical model...................................................................................17Properties...................................................................................................20

    Training and using (simplified)......................................................................24Training a NN............................................................................................24Using a NN.................................................................................................25Applications................................................................................................25

    Deeper insight...................................................................................................26Foreword on classification and predictions systems......................................26

    Threshold finding.......................................................................................27Receiver operating characteristic (From Wikipedia, the free encyclopedia)...................................................................................................................29Testing precautions....................................................................................31The art of splitting datasets........................................................................31Datasets cost..............................................................................................31Are we doing that good ? ...........................................................................32

    Training testing and using (the real stuff)......................................................33Preparing the datasets...............................................................................33Supervised learning...................................................................................34Gradient descent (From Wikipedia, the free encyclopedia)........................35Description of the method..........................................................................35Comments..................................................................................................37See also......................................................................................................37Backpropagation (From Wikipedia, the free encyclopedia)........................38

    When the problem is not that simple... Neural cooking ............................39Local optimum............................................................................................39Over-fitting/Under-fitting...........................................................................40

    Contents.............................................................................................41Clever Hans and Pfungst's study........................................................41Clever Hans effect..............................................................................42Reference...........................................................................................42

    Unsupervised learning...................................................................................45K-means algorithm (From Wikipedia, the free encyclopedia).....................46References.................................................................................................46Kohonen network .....................................................................................47

    The wide field of neural-network.......................................................................54

    p. 2/60 Wassner Hubert [email protected]

  • A word on algorithmic complexity.....................................................................55Training is cpu consuming, but .....................................................................55Solutions........................................................................................................55

    NN are not magic !............................................................................................56How to actualy create and use neural networks................................................57When NN are doing better than experts............................................................58Deeper Deeper inside : Exercises .....................................................................58

    Datasets.........................................................................................................58Bibliography......................................................................................................59

    Bibliography...................................................................................................59Web sites........................................................................................................60Softwares.......................................................................................................60

    Open source...............................................................................................60Proprietary.................................................................................................60

    p. 3/60 Wassner Hubert [email protected]

  • Course prsentation

    Prerequisits

    Good algorithmic knowledge

    Good math basics (continuity, drivation and optimisation)

    Good (C) programming skills

    Resume

    Neural commuting techniques are said to be bio-inspired/bio-mimetic , this means they are inspired from known biologic phnomemons. The goal is to use the learning and auto-organization capabilities of living organisms to exploit them on classic algorithmic problems.

    Applications are : Artifial Intelligence, classification, Identification, biometry, prediction, datamining ...

    The course is two fold :

    a slight theoretical part

    a pratice part using open-source librairies to solve real-life problems

    Presented techniques

    Classification basics

    Supervised learning

    feed forward multi-layer neural network

    Non-supervised learning

    Kohonen network (Self Organizing Maps)

    Objectives

    Students following this course won't be expert in the neural computing field. The goal of this course is to understand basics of neural computing theory. To be able to identify problems that can be solved by these techniques, and to be able to implement such solutions with software librairies.

    p. 4/60 Wassner Hubert [email protected]

  • Introduction

    Foreword on fractal approach

    This field of computer science can be quite hard to understand because lot of prrequisit are needed (like mathematics). This makes it a field of choice for engineer to make the difference from technicians. To make this course easier I will use what I call the fractal approach . The idea is to always know the goal of every part of the course. This leads to a certain amount of redundancy and repetition but avoid lost of motivation due fuzzy goals. Fractal is a type of object that has more or less same appearance whatever scale your looking at. A shore is fractal, if you look at it at a large scale, and then zoom in, you have the same kind of visual impression. This is why it is hard to know where you are when walking on a shore. It can be the same problem when walking linearly in a such a course. So to avoid this problem I will frequently recall local and global objectives of each part of this course. This is the equivalent of frequently zooming in an out on the shore map to know precisely where you are on the shore/in the course.

    That's why introduction part and deeper insight seems to talk about the same exact things and having the same goals but are not on the same scale .

    p. 5/60 Wassner Hubert [email protected]

  • Neural computing, a part of biomimetic ...

    Bionics (From Wikipedia, the free encyclopedia)(Redirected from Biomimetics)Jump to: navigation, search

    Bionics (also known as biomimetics, biognosis, biomimicry, or bionical creativity engineering) is the application of methods and systems found in nature to the study and design of engineering systems and modern technology. Also a short form of biomechanics, the word 'bionic' is actually a portmanteau formed from bio logy (from the Greek word " ", pronounced "vios", meaning "life") and electro nic .

    The transfer of technology between lifeforms and synthetic constructs is desirable because evolutionary pressure typically forces natural systems to become highly optimized and efficient. A classical example is the development of dirt- and water-repellent paint (coating) from the observation that the surface of the lotus flower plant is practically unsticky for anything (the lotus effect). Examples of bionics in engineering include the hulls of boats imitating the thick skin of dolphins, sonar, radar, and medical ultrasound imaging imitating the echolocation of bats.

    In the field of computer science, the study of bionics has produced cybernetics, artificial neurons, artificial neural networks, and swarm intelligence. Evolutionary computation was also motivated by bionics ideas but it took the idea further by simulating evolution in silico and producing well-optimized solutions that had never appeared in nature.

    Biomimetic is the field of algorithmic which is inspired from biology. The goal is to simulate known biologic phenomenons. This ables algorithms to have interesting capabilities such as auto organisation, adaptation and even learning ... at some point.

    Neural techniques mimics the biologic neuron.

    We must then investigate a little neural biology basics ...

    p. 6/60 Wassner Hubert [email protected]

  • Neuron (From Wikipedia, the free encyclopedia)

    Drawing by Santiago Ramn y Cajal of cells in the pigeon cerebellum. (A) Denotes Purkinje cells, an example of a bipolar neuron. (B) Denotes granule cells which are multipolar.

    Neurons are a major class of cells in the nervous system. Neurons are sometimes called nerve cells, though this term is technically imprecise, as many neurons do not form nerves. In vertebrates, neurons are found in the brain, the spinal cord and in the nerves and ganglia of the peripheral nervous system. Their main role is to process and transmit information. Neurons have excitable membranes, which allow them to generate and propagate electrical impulses.

    p. 7/60 Wassner Hubert [email protected]

  • Contents

    1 History 2 Anatomy and histology 3 Classes 4 Connectivity 5 Adaptations to carrying action

    potentials 6 Histology and internal structure 7 Challenges to the neuron doctrine 8 Neurons in the brain 9 See also 10 Sources 11 External links

    History

    The concept of a neuron as the primary computational unit of the nervous system was devised by the Spanish anatomist Santiago Ramn y Cajal in the early 20th century. Cajal proposed that neurons were discrete cells which communicated with each other via specialized junctions. This became known as the Neuron Doctrine, one of the central tenets of modern neuroscience. However, Cajal would not have been able to observe the structure of individual neurons if his rival, Camillo Golgi, (for whom the Golgi Apparatus is named) had not developed his silver staining method. When the Golgi Stain is applied to neurons, it binds the cell's microtubules and gives stained cells a black outline when light is shone through them.

    Anatomy and histology

    Many neurons are highly specialized, and they differ widely in appearance. Neurons have cellular extensions known as processes which they use to send and receive information. Neurons are typically 4 to 100 micrometres in diameter, the size varies depending on the type of neuron and the species it is from. [1]

    p. 8/60 Wassner Hubert [email protected]

  • The soma, or 'cell body', is the central part of the cell, where the nucleus is located and where most protein synthesis occurs.

    The dendrite is a branching arbor of cellular extensions. Most neurons have several dendrites with profuse dendritic branches. The overall shape and structure of a neuron's dendrites is called its dendritic tree, and is traditionally thought to be the main information receiving network for the neuron. However, information outflow (i.e. from dendrites to other neurons) can also occur.

    The axon is a finer, cable-like projection which can extend tens, hundreds, or even tens of thousands of times the diameter of the soma in length. The axon carries nerve signals away from the soma (and carry some types of information in the other direction also). Many neurons have only one axon, but this axon may - and usually will - undergo extensive branching, enabling communication with many target cells. The part of the axon where it emerges from the soma is called the 'axon hillock'. Besides being an anatomical structure, the axon hillock is also the part of the neuron that has the greatest density of voltage-dependent sodium channels. Thus it has the most hyperpolarized action potential threshold of any part of the neuron. In other words, it is the most easily-excited part of the neuron, and thus serves as the spike initiation zone for the axon. While the axon and axon hillock are generally considered places of information outflow, this region can receive input from other neurons as well.

    The axon terminal a specialized structure at the end of the axon that is used to release neurotransmitter and communicate with target neurons.

    Although the canonical view of the neuron attributes dedicated functions to its

    p. 9/60 Wassner Hubert [email protected]

  • various anatomical components, dendrites and axons very often act contrary to their so-called main function.

    Axons and dendrites in the central nervous system are typically only about a micrometer thick, while some in the peripheral nervous system are much thicker. The soma is usually about 1025 micrometers in diameter and often is not much larger than the cell nucleus it contains. The longest axon of a human motoneuron can be over a meter long, reaching from the base of the spine to the toes, while giraffes have single axons running along the whole length of their necks, several meters in length. Much of what we know about axonal function comes from studying the squid giant axon, an ideal experimental preparation because of its relatively immense size (0.51 millimeters thick, several centimeters long).

    Classes

    Functional classification

    Afferent neurons convey information from tissues and organs into the central nervous system.

    Efferent neurons transmit signals from the central nervous system to the effector cells and are sometimes called motor neurons.

    Interneurons connect neurons within specific regions of the central nervous system.

    Afferent and efferent can also refer to neurons which convey information from one region of the brain to another.

    Structural classification Most neurons can be anatomically characterized as:

    Unipolar or Pseudounipolar- dendrite and axon emerging from same process.

    Bipolar - single axon and single dendrite on opposite ends of the soma. Multipolar - more than two dendrites

    Golgi I - neurons with long-projecting axonal processes. Golgi II - neurons whose axonal process projects locally.

    Connectivity

    Neurons communicate with one another via synapses, where the axon terminal of one cell impinges upon a dendrite or soma of another (or less commonly to an axon). Neurons such as Purkinje cells in the cerebellum can have over 1000 dendritic branches, making connections with tens of thousands of other cells; other neurons, such as the magnocellular neurons of the supraoptic nucleus, have only one or two dendrites, each of which receives thousands of synapses. Synapses can be excitatory or inhibitory and will either increase or decrease activity in the target neuron. Some neurons also communicate via electrical synapses, which are direct, electrically-conductive junctions between cells.

    In a chemical synapse, the process of synaptic transmission is as follows: when

    p. 10/60 Wassner Hubert [email protected]

  • an action potential reaches the axon terminal, it opens voltage-gated calcium channels, allowing calcium ions to enter the terminal. Calcium causes synaptic vesicles filled with neurotransmitter molecules to fuse with the membrane, releasing their contents into the synaptic cleft. The neurotransmitters diffuse across the synaptic cleft and activate receptors on the postsynaptic neuron.

    The human brain has a huge number of synapses. Each of 100 billion neurons has on average 7,000 synaptic connections to other neurons. Most authorities estimate that the brain of a three-year-old child has about 1,000 trillion synapses. This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 100 to 500 trillion synapses. [2]

    Adaptations to carrying action potentials

    The cell membrane in the axon and soma contain voltage-gated ion channels which allow the neuron to generate and propagate an electrical impulse (an action potential). Substantial early knowledge of neuron electrical activity came from experiments with squid giant axons. In 1937, John Zachary Young suggested that the giant squid axon might be used to better understand neurons [3]. As they are much larger than human neurons, but similar in nature, it was easier to study them with the technology of that time. By inserting electrodes into the giant squid axons, accurate measurements could be made of the membrane potential.

    Electrical activity can be produced in neurons by a number of stimuli. Pressure, stretch, chemical transmitters, and electrical current passing across the nerve membrane as a result of a difference in voltage can all initiate nerve activity [4].

    p. 11/60 Wassner Hubert [email protected]

  • The narrow cross-section of axons lessens the metabolic expense of carrying action potentials, but thicker axons convey impulses more rapidly. To minimize metabolic expense while maintaining rapid conduction, many neurons have insulating sheaths of myelin around their axons. The sheaths are formed by glial cells: oligodendrocytes in the central nervous system and Schwann cells in the peripheral nervous system. The sheath enables action potentials to travel faster than in unmyelinated axons of the same diameter, whilst using less energy. The myelin sheath in peripheral nerves normally runs along the axon in sections about 1 mm long, punctuated by unsheathed nodes of Ranvier which contain a high density of voltage-gated ion channels. Multiple sclerosis is a neurological disorder that results from abnormal demyelination of peripheral nerves. Neurons with demyelinated axons do not conduct electrical signals properly.

    Challenges to the neuron doctrine

    The neuron doctrine is a central tenet of modern neuroscience, but recent studies suggest that this doctrine needs to be revised.

    First, electrical synapses are more common in the central nervous system than previously thought. Thus, rather than functioning as individual units, in some parts of the brain large ensembles of neurons may be active simultaneously to process neural information.

    Second, dendrites, like axons, also have voltage-gated ion channels and can generate electrical potentials that carry information to and from the soma. This challenges the view that dendrites are simply passive recipients of information and axons the sole transmitters. It also suggests that the neuron is not simply active as a single element, but that complex computations can occur within a single neuron.

    Third, the role of glia in processing neural information has begun to be appreciated. Neurons and glia make up the two chief cell types of the central nervous system. There are far more glial cells than neurons: glia outnumber neurons by as many as 10:1. Recent experimental results have suggested that glia play a vital role in information processing. [citations?]

    Finally, recent research has challenged the historical view that neurogenesis, or the generation of new neurons, does not occur in adult mammalian brains. It is now known that the adult brain continuously creates new neurons in the hippocampus and in an area contributing to the olfactory bulb. This research has shown that neurogenesis is environment-dependent (eg. exercise, diet, interactive surroundings), age-related, upregulated by a number of growth factors, and halted by survival-type stress factors. [5] [6]

    Neurons in the brain

    The number of neurons in the brain varies dramatically from species to species. The human brain has about 100 billion (1011) neurons and 100 trillion (1014) synapses. By contrast, the nematode worm (Caenorhabditis elegans) has 302 neurons. Scientists have mapped all of the nematode's neurons. As a result, such worms are ideal candidates for neurobiological experiments and tests. Many properties of neurons, from the type of neurotransmitters used to ion

    p. 12/60 Wassner Hubert [email protected]

  • channel composition, are maintained across species, allowing scientists to study processes occurring in more complex organisms in much simpler experimental systems.

    See also

    Artificial neuron Neural oscillations Mirror neuron Neuroscience Neural network Spindle neuron

    Sources

    Kandel E.R., Schwartz, J.H., Jessell, T.M. 2000. Principles of Neural Science, 4th ed., McGraw-Hill, New York.

    Bullock, T.H., Bennett, M.V.L., Johnston, D., Josephson, R., Marder, E., Fields R.D. 2005. The Neuron Doctrine, Redux, Science, V.310, p. 791-793.

    Ramn y Cajal, S. 1933 Histology, 10th ed., Wood, Baltimore.

    Peters, A., Palay, S.L., Webster, H, D., 1991 The Fine Structure of the Nervous System, 3rd ed., Oxford, New York.

    External links

    Cell Centered Database UC San Diego images of neurons. High Resolution Neuroanatomical Images of Primate and Non-Primate

    Brains.

    Retrieved from "http://en.wikipedia.org/wiki/Neuron"Categories: Neurons | Neuroscience | Medical terms

    p. 13/60 Wassner Hubert [email protected]

  • Artificial neuron (From Wikipedia, the free encyclopedia)

    An artificial neuron (also called a "node" or "neuron") is a basic unit in an artificial neural network. Artificial neurons are simulations of biological neurons, and they are typically functions from many dimensions to one dimension. They receive one or more inputs and sum them to produce an output. Usually the sums of each node are weighted, and the sum is passed through a non-linear function known as an activation or transfer function. The canonical form of transfer functions is the sigmoid, but they may also take the form of other non-linear functions, piecewise linear functions, or step functions. Generally, transfer functions are monotonically increasing.

    Contents

    1 Basic structure 2 History 3 Types of transfer functions

    3.1 Step function 3.2 Sigmoid

    4 See also 5 Bibliography

    Basic structure

    For a given artificial neuron, let there be m inputs with signals x1 through xm and weights w1 through wm.

    The output of neuron k is:

    Where (Phi) is the transfer function.

    p. 14/60 Wassner Hubert [email protected]

  • The output propagates to the next layer (through a weighted synapse) or finally exits the system as part or all of the output.

    History

    The original artificial neuron is the Threshold Logic Unit first proposed by Warren McCulloch and Walter Pitts in 1943. As a transfer function, it employs a threshold or step function taking on the values 1 or 0 only.

    See article on Perceptron for more details

    Types of transfer functions

    The transfer function of a neuron is chosen to have a number of properties which either enhance or simplify the network containing the neuron. Crucially, for instance, any multi-layer perceptron using a linear transfer function has an equivalent single-layer network; a non-linear function is therefore necessary to gain the advantages of a multi-layer network.

    Step function

    The output y of this transfer function is binary, depending on whether the input meets a specified threshold, . The "signal" is sent, i.e. the output is set to one, if the activation meets the threshold.

    Sigmoid

    A fairly simple non-linear function, the sigmoid also has an easily calculated derivative, which is used when calculating the weight updates in the network. It thus makes the network more easily manipulable mathematically, and was attractive to early computer scientists who needed to minimise the computational load of their simulations.

    See: Sigmoid function

    p. 15/60 Wassner Hubert [email protected]

  • Bibliography

    McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 7:115 - 133.

    Retrieved from "http://en.wikipedia.org/wiki/Artificial_neuron"Category: Neural networks

    Important note : A lot of other functions than sigmoid can be used. Sigmoid is just the most popular, it may be sometimes not the best choise.

    p. 16/60 Wassner Hubert [email protected]

  • Mathematical model and properties of Artificial Neural Networks

    Mathematical model

    Imagine a simple 2D classification problem with 2 classes.

    A multidimentional (2D) sigmoid (= a 2 input neurons) looks like this.

    p. 17/60 Wassner Hubert [email protected]

  • So one single neuron (W0 = 1 and W1 =1 ) can do a linear separation with a fading , abling to handle uncertain decisions.

    Where : s(x) = 1/(1+exp(-x))

    If output is close to 0 then the presented input data is believed to belong to class 0 (green x).

    If output is close to 1 then the presented input data is believed to belong to class 1 (red +).

    If output is around 0.5 then you know that the decision will be uncertain.

    p. 18/60 Wassner Hubert [email protected]

    s(wix

    i)

    x y

    z

  • One neuron can't do much, but several neurons (with an appropriate learning phase) can do a lot more...

    Using two neurons (on the same layer) can show more complex landscapes .

    Example :

    Corresponding parameters are W0 = 1, W1 = 1 for left neuron and W0 = 2, W1 = 0 for right neuron, and W0 = 1, W1 = 1 for the output neuron.

    p. 19/60 Wassner Hubert [email protected]

    s(wix

    i)

    x y

    z

    s(wix

    i)

    s(wix

    i)

  • Properties

    Universal fonction approximation

    It has been proved that every continuous fonction f: |Rn->|Rp can be approximated by a neural network.

    Neural network can be seen as a function decomposition using a sigmoide (or else) base, like Fourrier analysis is a fonction decomposition over sine and cosine functions.

    Since we are viewing our problems like function approximation and that our model is a continuous function, it is important to ensure that the way we present data to the neuron network is seen as continuous.

    For NN the continuity constraint can be expressed that way: if two entry patterns are close, the corresponding NN output patterns should be close too. If it's not the case the learning process will be more difficult and NN performance will probably be low.

    Example approximating the XOR function :

    A B A XOR B

    0 0 0

    1 0 1

    0 1 1

    1 1 0

    Input pattern (1,0) and (1,1) are close but correponding output is not close (1) and (0). This makes it a standard hard learning dataset.

    p. 20/60 Wassner Hubert [email protected]

  • Note : The fact that neural networks are universal function generator does not implies that it can solve any classification problems.

    Example, this dataset show a problem that can not be totaly solved :

    Red (+) and green (x) class overlaps each over (when considering the only two parameters shown here). So the neural network can't do a 100 % correct classification. The overlapped regions will lead to uncertain answers/outputs. If the network is properly train, the output can reflect a measure the uncertainity (probablity approximation) which is still an important information even if you can't really make a straight decision.

    Generalisation

    When using neural networks as function approximation we implicitly use an other neural property which is generalisation. The networks learn/approximate a (mathematical) fonction only knowing a limited number of values.

    Example : A NN is used to approximate sinus function. The training dataset is a serie of pair (x,sin(x)) where x is sampled every PI/20.

    p. 21/60 Wassner Hubert [email protected]

  • The networks is said to have a good generalisation behaviour when tested inbetween values used in training.

    Example : the above trained network is tested on the same range but with a PI/80 sampling rate.

    p. 22/60 Wassner Hubert [email protected]

  • This property comes from the fact that we are approximating continous functions with a composition of continuous fonctions.

    Robustness

    Since it is a statisitical learning ANN have a tolerant behavious facing noised data.

    Adding noise on the same sinus function used above don't bother much the training.

    p. 23/60 Wassner Hubert [email protected]

  • Training and using (simplified)

    Training a NN

    Training a NN is the process of estimating good parameters (Wi) regarding to a training dataset. Good parameters is another way to say parameters that lead to a mininum of errors. The error is simply the difference between the expected output and the actual NN output. The training is mainly an optimisation problem : finding the best parameters to lower the error as far as possible. It exists a lot of optimisation techniques, they are more or less all applicable here. The most common is the gradient descent. The technique is to follow iteratively the direction given by the error gradient. For each input pattern we can compute the error derivate indicating how to modify each NN parameter in order to minimise the error. This process may take some CPU power if the dataset, network or both is/are large.

    Imagin that the network has only two parameters (Wi) training the network would be equivalent to find the minimum of that curve (Z axis correponding to the mean error of the network, W0 and W1 correponds to x and y.)

    Generaly speaking we talk about error hypersurface because the number of parameters is oftenly far superior to 2.

    p. 24/60 Wassner Hubert [email protected]

  • Using a NN

    Using a NN is simply propagating the input values towards the output, it is generaly a low CPU consuming stage (just a few fonction evaluations). This is one of the advantages of NN using them is a quite small task. Training might need somme time and CPU power but using them is quick and easy.

    Somme postprocessing is sometimes nedeed on real life application : A typical classification problem will need a final decision where the NN only outputs a numerical vector (in [-1,1]). Classical algorithmes are then used to find the best fitting class and the distance to the best second class can be good confidence index.

    NN is rarely the complete solution to a problem, but it can be a decisive part of a lot of solutions.

    Example in computer gaming : making a good AI-gamer/bot for action games is hard and complex because no one can realy mathematically express what is a good stategy, because all depends on the human opponent gaming style. Directly training a NN to decide each action is a far too complex task, beause you should be able to label each action/decision as good or not good (since we are doing supervised training). The quality of each decisions (leading to win or lose the game) is simply inaccessible.

    An other way to look at making a good AI-bot is to build a NN that simply predicts the position and or action of the human opponent. This far more easy because you have this information (simple game recording is enough). You may then feed the information predicted by the NN to an expert system. Progamming a decent AI-bot whith an expert system that has information on the next moves of the human opponent is rather simple...

    Here is interesting demonstration : OCR demo : http://www.sund.de/netze/applets/BPN/bpn2/ochre.html

    function approximation : http://neuron.eng.wayne.edu/bpFunctionApprox/bpFunctionApprox.html

    Applications

    There is a very wide variety of applications of neural techniques, and there is certainly even more to discover. Here is a non exhaustiv list :

    Classification (includes biometry)

    generalisation , interpolation, prediction ( $ credit allocation) energy consumption, ...

    prediction/forcasting (time or else)

    gaming

    p. 25/60 Wassner Hubert [email protected]

  • Deeper insight

    Foreword on classification and predictions systems

    The field is quite large it is very important to use correct the words to identify each kind of application.

    Classification is the process of attributing a class to an unkown entry. The system measures features about the entry and the decides to which class it belongs.

    A classical example is biometric : one person is measured thru a biometric system (voice sample, fingerprint, iris, ...) theses measures are then compared to stored models and the program decides whever to grant access or not. This particular case of classification is called verification, the system verifies the claim of the user. The user claims it's identity, the system verifies it. There is two classes : positive recognition and negative recognition.This makes it a two classes problem. For real life application is it strongly advised to extend it to, at least, 3 classes problems :

    access granted

    access refused

    can't tell (restarts measures, a limited number of times.)

    Identification is different kind of classification problem, you have no information about the input data, and you have to attribute a class to it. One good example is speaker identification. The problem is to identify the voice of a person in a stored collection of speaker models.

    It is sometimes important to distinguish two subkinds :

    When the user knows that he's measured and wants to : collaborative context.

    When the user don't know or even don't want to be classified .

    Prediction/forecasting task is like weather forecasting, predicting the futur of a value. Beware this term is oftenly missused. We are trying to make programs that predict things. In this field prediction may express far different operations. In biometric application for instance, some people are talking about prediction for the decision process which is not really a prediction simply because there is no time dependence. So the term prediction is used nearly every where a program takes a decision.

    Since one of the goal of an engineer is to lead software fonctionality to business, it is very important to be able to measure the effectiveness of a prediction system to tailor a commercial system using prediction.

    The mean quality of a program as a two class (positive|negative) predictor is estimated thru these features and definitions :

    true positives : are positives decisions that are confirmed

    p. 26/60 Wassner Hubert [email protected]

  • true negative : are negative decisions that are confirmed

    false positive : are positives decisions that should be negative

    false negative : are negative decision that sould be positive

    Imagine your are building an biometric system to access banking data the cost of a false positive error is not the same than a false negative error.

    On the first case you give access to a person who shouldn't have it. On the second case you deny access to a genuine user. Remember that training algorithmes are more or less optimisation problems, it is then very important to adapt the cost of the training function to handle these different kind of errors...

    Extending this concept to N >2 classes classification problem lead to the concept of confusion matrix .

    A confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabelling one as an other).

    In the example confusion matrix below, of the 8 actual cats, the system predicted that three were dogs, and of the six dogs it predicted that one was a rabbit and two were cats. We can see from the matrix that the system in question has trouble distinguishing between cats and dogs, but can make the distinction between rabbits and other types of animals pretty well.

    Example of confusion matrixCat Dog Rabbit

    Cat 5 3 0

    Dog 2 3 1

    Rabbit 0 2 11

    Threshold finding

    NN ouputs fuzzy values but lot of real life problems need discrete answers : yes/no but no other solution. So the basic way to do so is to threshold the NN output. All the difficulty is in the choice of that threshold. Let see the process for a 2 classes classification problem. One common way to choose the threshold is to draw the score histogram for each class on the same graph.

    p. 27/60 Wassner Hubert [email protected]

  • This figures show that if the threshold is set at 3 we expect : (s class1; s>=3 =>class2)

    virtualy no false positive on class1

    some false negatives on class1 (the ones scored > 3)

    class2 should have no false negatives

    classe2 will have some false positives

    Moving the threshold allows different compromise on the classifier accuracy.

    Error statisitic should be made for each possible threshold : this leads to ROC-curve (Receiver Operating Characteristic).

    p. 28/60 Wassner Hubert [email protected]

  • Receiver operating characteristic (From Wikipedia, the free encyclopedia)

    In signal detection theory, a receiver operating characteristic (ROC), also receiver operating curve, is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting the fraction of true positives (TP) vs. the fraction of true negatives (TN). The usage receiver operator characteristic is also common.

    ROC curves are used to evaluate the results of a prediction and were first employed in the study of discriminator systems for the detection of radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor. The initial research was motivated by the desire to determine how the US RADAR "receiver operators" had missed the Japanese aircraft.

    In the 1950s they began to be used in psychophysics, to assess human (and occasionally animal) detection of weak signals. They also proved to be useful for the evaluation of machine learning results, such as the evaluation of Internet search engines. They are also used extensively in epidemiology and medical research and are frequently mentioned in conjunction with evidence-based medicine.

    The best possible prediction method would yield a graph that was a point in the upper left corner of the ROC space, i.e. 100% sensitivity (all true positives are found) and 100% specificity (no false positives are found). A completely random predictor would give a straight line at an angle of 45 degrees from the horizontal, from bottom left to top right: this is because, as the threshold is raised, equal numbers of true and false positives would be let in. Results below this no-discrimination line would suggest a detector that gave wrong results consistently, and could therefore be simply used to make a detector that gave useful results by inverting its decisions.

    How a ROC curve can be interpreted

    Sometimes, the ROC is used to generate a summary statistic. Three common versions are:

    p. 29/60 Wassner Hubert [email protected]

  • the intercept of the ROC curve with the line at 90 degrees to the no-discrimination line

    the area between the ROC curve and the no-discrimination line the area under the ROC curve, often called AUC. d' (pronounced "d-prime"), the distance between the mean of the

    distribution of activity in the system under noise-alone conditions and its distribution under signal plus noise conditions, divided by their standard deviation, under the assumption that both these distributions are normal with the same standard deviation. Under these assumptions, it can be proved that the shape of the ROC depends only on d'.

    ROC curve of three epitope predictors

    However, any attempt to summarize the ROC curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm.

    The machine learning community most often uses the ROC AUC statistic. This measure can be interpreted as the probability that when we randomly pick one positive and one negative example, the classifier will assign a higher score to the positive example than to the negative. In engineering, the area between the ROC curve and the no-discrimination line is often preferred, because of its useful mathematical properties as a non-parametric statistic. This area is often simply known as the discrimination. In psychophysics, d' is the most commonly used measure.

    The illustration to the right shows the use of ROC graphs for the discrimination between the quality of different epitope predicting algorithms. If you wish to discover at least 60% of the epitopes in a virus protein, you can read out of the graph that about 1/3 of the output would be falsely marked as an epitope. The information that is not visible in this graph is that the person that uses the algorithms knows what threshold settings give a certain point in the ROC graph.

    Retrieved from "http://en.wikipedia.org/wiki/Receiver_operating_characteristic"Category: Detection theory

    Demos : Very nice applet showing dynamics of ROC curves

    http://www.anaesthetist.com/mnm/stats/roc/ ROC analysis thru the web

    p. 30/60 Wassner Hubert [email protected]

  • http://www.rad.jhmi.edu/jeng/javarad/roc/main.html

    Testing precautions

    The above testing definitions and statisitics must be computed on correct datasets. Since we are using training data to estimate models parameters, the network will have a too good behaviour on the same dataset. It's very important to test network on data that haven't been seen before. It's the only way to ensure measuring the real generalisation feature of neural networks.

    The art of splitting datasets

    We are basicaly making statisitics on both training and testing networks. We are then willing to have biggest dataset possible. For most real life problem we have limited access to dataset moreover training and testing datasets must be different. This is the first constrain to take into account. The second is to have a good coverage of real data into both training and testing and for all classes.

    Random shuffle of data before splitting into train and test datasets is a good basic solution. Sometimes it's not a good choice if there is unknown statisitical links between samples. Some expert information might be needed at this stage.

    Datasets cost

    It may sound strange but oftenly training datasets are quite small and proprietary. It's simply because the are expensive to build. They are expensive because most of the time the output (called label or annotation ) is determined by an expert , and you need a big datasets to properly train and test your ANN. (Note : this is true for any other statisitical models, not only ANN).

    It is very important to estimate the cost of dataset collection and annotation on this kind of projects.

    When little data is available one option is the Jack knife technique.

    Jackknifed statistics are created by systematically dropping out subsets of data one at a time and assessing the resulting variation in the studied parameter.

    p. 31/60 Wassner Hubert [email protected]

  • Jack-knife technique :

    In some case the annotation process is the most expensive. Bootstrap learning may take advantage of this situation. One can train a NN with the labeled data and use the trained network to ease the annotation of more data. Doing so incrementaly grows the annotated database and raise the NN accuracy.

    Are we doing that good ?

    To know if the NN is working good one must compare it to the minimal classifier (or random predictor) which only uses the apriori knowled of class probablity.

    Example : A classification problem with 2 classes having a distribution 10 % for first class and 90% for the second can be pretty well approached by choosing class randomly according to the apriori distribution.

    If your NN can't do better that the minimal classifier :

    Something is going wrong somewhere in your data processing. (bug?)

    Your problem is maybe harder that you thought. Looking at two dimensional projections of the training data can give hints to the classification difficulty.Example, the following 2D projection shows a very hard/(imposible?) classification task. More separated data points must be found at least on

    p. 32/60 Wassner Hubert [email protected]

    Whole dataset

    testtrain

    traintraintraintraintraintraintraintraintraintrain

    Whole dataset

    train

    traintraintraintraintraintraintraintraintraintrain

    test

    Whole dataset

    train

    train

    traintraintraintraintraintraintraintraintrain

    test

    etc...

    training

    testing

    training

    testing

    training

    testing

    n trainings,each on (n-1) examples

    ( )

    ( ) n tests,each on different(but similar)trained model

  • some 2D projections to hope beeing able to train a classificator on this problem...

    Note : You should be suspicious if your NN does a 100% correct classification. Be sure that the problem is that easy. Some bug may lead to apparent 100% correctness, the optimistic programmer may go into pitfalls if doesn't discover the truth. This happens more often than you may think ... ;-)

    Training testing and using (the real stuff)

    Preparing the datasets

    Depending on activation function used by input neurons, input data must be in a given range ([0,1] or [-1,+1], ...). Real life problems rarely directly fit into this format. A normalisation function must be used. This function must be carefully choosed since it is implicitily the sensitivity of the neurons.

    These are the caracteristic needed for normalisation :

    The input normalisation must preserve the dynamic of the data.

    The output normalisation must enhance error measurement, in order to ease the gradient descent. And it should be easily transfomed back into real-world data.

    p. 33/60 Wassner Hubert [email protected]

  • Warning : Normalising by simple homotety (linear normalisation) may not be a good solution.

    Example : the following histogram shows a data set where almost all values are in [0,11] , but a verfy few are bigger that 25. Simple linear normalisation into [0, 1] would squizz all the data dynamic in [0, ~0.25]. This may not suit the input neuron range and then blinding the NN.

    Solution is to study the statisitic of every input data to find a good scaling function. To main solutions here :

    building a non linear normalisation function which has a better respect to the data dynamic.

    Using a linear normalisation using hand/expert selected extremum values (this works well if extremum cases are easy classification cases).

    Supervised learning

    Training forward multi-layer neural network is finding the parameter values Wi where the error is minimal over a given dataset. This is an optimisation problem. A common way to solve this kind of problem is gradient based

    p. 34/60 Wassner Hubert [email protected]

  • methods. The main idea is to use information given by the derivate of the error function.

    Gradient descent (From Wikipedia, the free encyclopedia)

    Gradient descent is an optimization algorithm that approaches a local minimum of a function by taking steps proportional to the negative of the gradient (or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.

    Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals.

    [edit]

    Description of the method

    We describe gradient descent here in terms of its equivalent (but opposite) cousin, gradient ascent. Gradient ascent is based on the observation that if the

    real-valued function is defined and differentiable in a neighborhood of a

    point , then increases fastest if one goes from in the direction of the

    gradient of F at , . It follows that, if

    for > 0 a small enough number, then . With this observation in mind, one starts with a guess for a local maximum of F, and considers the sequence such that

    We have so hopefully the sequence converges to the desired local maximum. Note that the value of the step size is allowed to change at every iteration.

    Let us illustrate this process in the picture below. Here F is assumed to be defined on the plane, and that its graph looks like a hill. The blue curves are the contour lines, that is, the regions on which the value of F is constant. A red arrow originating at a point shows the direction of the gradient at that point. Note that the gradient at a point is perpendicular to the contour line going through that point. We see that gradient ascent leads us to the top of the hill, that is, to the point where the value of the function F is largest.

    p. 35/60 Wassner Hubert [email protected]

  • To have gradient descent go towards a local minimum, one needs to replace with .

    p. 36/60 Wassner Hubert [email protected]

  • The gradient descent method applied to an arbitrary functionContour-lines 3D-view

    Comments

    Note that gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones.

    Two weaknesses of gradient descent are:

    1. The algorithm can take many iterations to converge towards a local maximum/minimum, if the curvature in different directions is very different.

    2. Finding the optimal per step can be time-consuming. Conversely, using a fixed can yield poor results. Methods based on Newton's method and inversion of the Hessian using Conjugate gradient techniques are often a better alternative.

    A more powerful algorithm is given by the BFGS method which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated linear search algorithm, to find the "best" value of .

    See also

    Stochastic gradient descent Newton's method Optimization Line search Delta rule

    Retrieved from "http://en.wikipedia.org/wiki/Gradient_descent"

    p. 37/60 Wassner Hubert [email protected]

  • Category: Optimization algorithms

    Backpropagation (From Wikipedia, the free encyclopedia)

    Backpropagation is a supervised learning technique used for training artificial neural networks. It was first described by Paul Werbos in 1974, and further developed by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams in 1986.

    It is most useful for feed-forward networks (networks that have no feedback, or simply, that have no connections that loop). The term is an abbreviation for "backwards propagation of errors". Backpropagation requires that the transfer function used by the artificial neurons (or "nodes") be differentiable.

    The summary of the technique is as follows:

    1. Present a training sample to the neural network. 2. Compare the network's output to the desired output from that sample.

    Calculate the error in each output neuron. 3. For each neuron, calculate what the output should have been, and a

    scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error.

    4. Adjust the weights of each neuron to lower the local error. 5. Assign "blame" for the local error to neurons at the previous level, giving

    greater responsibility to neurons connected by stronger weights. 6. Repeat the steps above on the neurons at the previous level, using each

    one's "blame" as its error.

    As the algorithm's name implies, the errors (and therefore the learning) propagate backwards from the output nodes to the inner nodes. So technically speaking, backpropagation is used to calculate the gradient of the error of the network with respect to the network's modifiable weights. This gradient is almost always then used in a simple stochastic gradient descent algorithm to find weights that minimize the error. Often the term "backpropagation" is used in a more general sense, to refer to the entire procedure encompassing both the calculation of the gradient and its use in stochastic gradient descent. Backpropagation usually allows quick convergence on satisfactory local minima for error in the kind of networks to which it is suited.

    It is important to note that backprop networks are neccessarily multilayer (usually with one input, one hidden, and one output layer). In order for the hidden layer to serve any useful function, Multilayer networks must have non-linear activation functions for the multiple layers. Non-linear activation functions that are commonly used include the logistic function, the softmax function, and the gaussian function.

    The backpropagation algorithm for calculating a gradient has been rediscovered a number of times, and is a special case of a more general technique called automatic differentiation in the reverse accumulation mode.

    p. 38/60 Wassner Hubert [email protected]

  • Other usefull readings :

    http://en.wikipedia.org/wiki/Non-parametric_methods

    http://en.wikipedia.org/wiki/Expectation-Maximization

    When the problem is not that simple... Neural cooking

    Rules of thumb to use while building neural networks ( a little theory is needed )

    Local optimum

    Gradient method may in some case lead to local unsufficient local optimum.

    Schematic definition of different kind of optimum :

    Gradient search technic can be stucked in local minimum.

    That why gradient techniques usualy begin with a random weight Wi intiation. So each training can virtualy lead to a different NN solution. Some are better than overs...

    Several solutions exists :

    p. 39/60 Wassner Hubert [email protected]

  • http://en.wikipedia.org/wiki/Stochastic_gradient_descent

    http://en.wikipedia.org/wiki/Simulated_annealing

    http://en.wikipedia.org/wiki/Genetic_algorithms

    ...

    Over-fitting/Under-fitting

    Over-fitting

    In some cases the NN learns unwanted feature.

    This is an interesting story to introduce over-fitting phenomenon.

    Clever Hans (From Wikipedia, the free encyclopedia)

    Clever Hans performs

    Clever Hans (in German, der Kluge Hans) was a horse that was claimed to have been able to perform arithmetic and other intellectual tasks.

    In 1907, psychologist Oskar Pfungst demonstrated that the horse's claimed abilities were due to an artifact in the research methodology, wherein the horse was responding directly to involuntary clues in the body language of the human trainer, who had the faculties to solve each problem. In honour of Pfungst's study, the anomalous artifact has since been referred to as the Clever Hans effect and has continued to be a recurrent problem with any research into animal cognition.

    p. 40/60 Wassner Hubert [email protected]

  • Contents

    1 Clever Hans and Pfungst's study 2 Clever Hans effect 3 Reference 4 See also 5 External links

    Clever Hans and Pfungst's study

    The horse, Hans, had been trained by a Mr. von Osten to tap out the answers to arithmetic questions with its hoof. The answers to questions involving reading, spelling and musical tones were converted to numbers, and the horse also tapped out these numbers.

    Seeking to ascertain a scientific basis or disproof for the claim, philosopher and psychologist Carl Stumpf formed a panel of 13 prominent scientists, known as the Hans Commission, to study the claims that a Clever Hans could count. The commission passed off the evaluation to Pfungst, who tested the basis for these claimed abilities by:

    1. Isolating horse and questioner from spectators, so no cues could come from them

    2. Using questioners other than the horse's master 3. By means of blinders, varying whether the horse could see the questioner 4. Varying whether the questioner knew the answer to the question in

    advance.

    Using a substantial number of trials, Pfungst found that the horse could get the correct answer even if von Osten himself did not ask the questions, ruling out the possibility of fraud. However, the horse got the right answer only when the questioner knew what the answer was, and the horse could see the questioner. He then proceeded to examine the behaviour of the questioner in detail, and showed that as the horse's taps approached the right answer, the questioner's posture and facial expression changed in ways that were consistent with an increase in tension, which was released when the horse made the final, "correct" tap. This provided a cue that the horse could use to tell it to stop tapping.

    The social communication systems of horses probably depend on the detection of small postural changes, and this may be why Hans so easily picked up on the cues given by von Osten (who seems to have been entirely unaware that he was providing such cues). However, the capacity to detect such cues is not confined to horses. Pfungst proceeded to test the hypothesis that such cues would be discernible, by carrying out laboratory tests in which he played the part of the horse, and human participants sent him questions to which he gave numerical answers by tapping. He found that 90% of participants gave sufficient cues for him to get a correct answer.

    p. 41/60 Wassner Hubert [email protected]

  • Clever Hans effect

    The risk of Clever Hans effects is one strong reason why comparative psychologists normally test animals in isolated apparatus, without interaction with them. However this creates problems of its own, because many of the most interesting phenomena in animal cognition are only likely to be demonstrated in a social context, and in order to train and demonstrate them, it is necessary to build up a social relationship between trainer and animal. This point of view has been strongly argued by Irene Pepperberg in relation to her studies of parrots, and by Alan and Beatrice Gardner in their study of the chimpanzee Washoe. If the results of such studies are to gain universal acceptance, it is necessary to find some way of testing the animals' achievements which eliminates the risk of Clever Hans effects. However, simply removing the trainer from the scene may not be an appropriate strategy, because where the social relationship between trainer and subject is strong, the removal of the trainer may produce emotional responses preventing the subject from performing. It is therefore necessary to devise procedures where none of those present knows what the animal's likely response may be.

    For an example of an experimental protocol designed to overcome the Clever Hans effect, see Rico (Border Collie).

    As Pfungst's final experiment makes clear, Clever Hans effects are quite as likely to occur in experiments with humans as in experiments with other animals. For this reason, care is often taken in fields such as perception, cognitive psychology, and social psychology to make experiments double-blind, meaning that neither the experimenter nor the subject knows what condition the subject is in, and thus what his or her responses are predicted to be. Another way in which Clever Hans effects are avoided is by replacing the experimenter with a computer, which can deliver standardized instructions and record responses without giving clues.

    Reference

    Pfungst, O. (1911). Clever Hans (The horse of Mr. Von Osten): A contribution to experimental animal and human psychology (Trans. C. L. Rahn). New York: Henry Holt. (Originally published in German, 1907).

    The horse learnt something which is ,in a way,far more complex than basic arithmetics.

    A NN net can do that too, when some unwanted statistics biais lay in training data.

    Building a Hans NN doing something different than expected without knowing it can be a very deceptive experience...

    Over-fitting can occurs when the model (NN) has too much parameters regarding to the size of the training dataset, but it's not the only case...

    p. 42/60 Wassner Hubert [email protected]

  • Overfitting example on sinus function approximation :

    Solutions :

    Ensure to avoid any unwanted statisitical biais.

    Try to use the smaller NN possible for a given problem (limiting the number of parameters of the model).

    Use a bigger training datasets.

    While trainning, using a different dataset (test) to detect when overfitting occurs : basicaly when the mean error of the test set (not used to train the NN) is increasing. This adds a difficulty to the art of splitting datasets . A good training procedures uses 3 different datasets :

    a training set : the one used to estimate the NN weights/parameters.

    a testing set : to avoid overfitting.

    a validation set : to measure the actual perfomance of the NN.

    p. 43/60 Wassner Hubert [email protected]

  • Studing error curves on train and test set can be helpful to find the best moment to stop training.

    The error curve on train and test set are more or less the same at the begining. At one point the training error continues to fall down but the error on the test set doesn't and even get bigger at some points.

    The divergence point of this two curves is the best moment to stop training because it is when the NN generalises the best. After that point the NN is too specializing on the training set and will be unable to produce correct output on new data.

    Under-fitting

    The over side of fitting problem is under-fitting, when the model constraint are to strong regarding to the data statisitics.

    The example below shows the underfitted output of a sinus function approximation.

    p. 44/60 Wassner Hubert [email protected]

  • Solutions are :

    Raising the number of neurons (and adapting topology, one single hidden layer is the most current topology but not always the best...)

    Make sure that your problem can be solved the way you presented it to the NN. Maybe the input data are not (that) relevant to the problem...

    Unsupervised learning

    Un-supervised algorithms are in a way simpler that supervised ones :

    There is no need to handle labels (the class information of each data entry)

    Mathematic basic needs can be quite low to understand and use them

    The problem is that they are quite counterintuitive and maybe need more abstraction capabilities.

    First of all the question is how a machine can learn something if no one tels what is expected ?

    Let's see K-means algorithm as an introduction ...

    p. 45/60 Wassner Hubert [email protected]

  • K-means algorithm (From Wikipedia, the free encyclopedia)

    The K-means algorithm is an algorithm to cluster objects based on attributes into k partitions. It is a variant of the expectation-maximization algorithm in which the goal is to determine the k means of data generated from gaussian distributions. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the function

    where there are k clusters Si, i = 1,2,...,k and i is the centroid or mean point of

    all the points .

    The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).

    The algorithm has remained extremely popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points. Recently, however, Arthur and Vassilvitskii showed that there exist certain point sets on which k-means takes

    superpolynomial time - - to converge.

    In terms of performance the algorithm is not guaranteed to return a global optimum. The quality of the final solution depends largely on the initial set of clusters, and may, in practice, be much poorer than the global optimum. Since the algorithm is extremely fast, a common method is to run the algorithm several times and return the best clustering found.

    Another main drawback of the algorithm is that it has to be told the number of clusters (i.e. k) to find. If the data is not naturally clustered, you get some strange results. Also, the algorithm works well only when spherical clusters are naturally available in data.

    References

    J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297

    , S. Vassilvitskii (2006): "How Slow is the k-means Method?," Proceedings of the 2006 Symposium on Computational Geometry (SoCG).

    p. 46/60 Wassner Hubert [email protected]

  • Very nice applet showing k-means algorithm running :

    http://www.leet.it/home/lale/joomla/component/option,com_wrapper/Itemid,50/

    K-means algorithm :

    1. choosing the supposed number of clusters k

    2. set randomly the position of the centroid/mean

    3. constructs a new partition by associating each input point with the closest centroid/mean

    4. update centroid/mean coordinates (mean of all data corresponding to the cluster)

    5. does the partions changed ?Yes -> goto 3No -> end of the algorithm.

    Since this algorithm minimizes intra-cluster variance, we can expect for an easy k-classification problem that our k centroid are the centers of each class. Then we just need to label a few data in each class to test this hypothesis and give a class label to each cluster.

    This is an idealistic case but it helps to understand how you can do unsupervised training. Remember we didn't used any label information while training. Only a very few (not enough for supervised training) label are used only to make the class/cluster matching (idealy we only needed k label, one for each class).

    Kohonen network

    Kohonen network also known as Self Organizing Maps are based on k-means algorithm. The additional feature is that each centroid (mean) is located on a map such as topology is conserved. This means that centroids that are close (in the problem space) should also be close on the map.

    One can choose any type of map (in dimension and topology) this is a way to represents any dimensionnal data ( problem space ) in a selected map (idealy 2D, which is convenient for human)...

    SOM helps to represents high dimensional data in a low dimensionnal space while preserving topological information.

    Example : a 2D Konen network map holding 3D input neurons. When trained,

    p. 47/60 Wassner Hubert [email protected]

  • each neron will hold a 3D representing vector (a centroid like in kmeans algorithm), the position of each neuron on the map should respect the input topology (wich is 3D in that case). So this map should be able to represent 3D information on a 2D map, if correctly trained.

    One important feature of SOM is that neurons are set on a map where location is important.

    In SOM, neighborhood is defined as a mathematical function which parameters are one neuron of the map and a neighborhood level , the output is the list of neurons belonging to this neighborhood. (The higher the neighborhood level, the higher the number of neurons in the neighborhood).

    Neighborhood example on a square 2D map :

    p. 48/60 Wassner Hubert [email protected]

  • All the input data are presented to the neurons, the closest neuron, and its neighborhood is selected, to be updated toward the input data sample. The neighborhood is slowly decreased along the iterations...

    Note : a lot of different topology are possible. This choice is often the core of the problem defining properties of the map...

    SOM training algorithm :

    1. initialise neuron map

    2. loop over a decreasing neighborhood schedule

    3. loop over all input data

    4. search closest neuron (n) from current input sample (S)

    5. get the closest neuron and its neighborhood even closer to the input data sample. W(t+1) = W(t) + a * (S-W(t)), where a is a learning factor and S is the current sample.

    6. End of loop over input datamining

    7. end of neighborhood decreasing loop

    p. 49/60 Wassner Hubert [email protected]

  • The training phase can be viewed like a deformation of the neural map toward the input data shape. A large neighborhood is like a quite rigid map, a small neighborhood is like a soft map. So the map is going from a rigid to a soft state to gradualy fit on the input data.

    Here is a example of fitting 2D uniform random data (input space) into a 2D SOM :

    How to read the map : The 2D position on the figure comes out of the input data space, the topology information is in the grid connexions.

    The resulting straight grid indicates that we are modeling a 2D square into a 2D square (preserving topology)...

    The following example is more interesting... The input data is still 2D here but the topology is different. Input data is displayed in a 2D cross shape (which is a different topology from the square), Kohonen map is still a 2D square

    p. 50/60 Wassner Hubert [email protected]

  • A Kohonen maps are inspired form the localisation biological neurons. You will see in that example how a SOM can help to find one's way in the data.

    Imagin that you need to operate a robot that have to move inside the input data like in a maze (inside the cross shape). If you have to go from A to B, the simple algorithm is to computing the A->B vector direction and using it to move the robot. This will lead to go out of the cross (or hit the wall of the maze).

    p. 51/60 Wassner Hubert [email protected]

    A

    B

  • Using The Kohonen map can help us to find the best way, this is how to use it :

    find the neurons a & b corresponding to A & B points.

    Compute the a->b vector in the map (using neuron map coordinates)

    follow the map vector from neuron a to neuron b

    each neuron on the road hold a point/vector (in the input space) showing, step by step, the way from A to B in the input space.

    When appling this algorithm on the cross shape figure you will see that your robot is carefully avoiding walls .

    Note : that the convergence of the map must be carefully done... The cactus example below would need some training parameters tuning...

    p. 52/60 Wassner Hubert [email protected]

    a

    b

  • SOM can be viewed as a minimum deformation mapping from one topology to an other.

    Remark : The following figure shows the mapping a N Dimesionnal set of data into a 1dimensionnal map can lead to a solution to the traveling salesman problem . Note that a circular topology would suit better than linear.

    A few web demos :

    2D SOM http://www-ti.informatik.uni-tuebingen.de/~goeppert/KohonenApp/KohonenApp.html

    an other 2D SOM http://www.cs.utexas.edu/users/yschoe/java/javasom/Base.html

    color SOM http://davis.wpi.edu/~matt/courses/soms/applet.html

    traveling salesman problem http://www.ice.nuie.nagoya-u.ac.jp/~l94334/bio/tsp/tsp.html

    OCR http://www.ice.nuie.nagoya-u.ac.jp/~l94334/bio/tsp/tsp.html

    WEBSOM http://websom.hut.fi/websom/milliondemo/html/root.html

    topology preservation demo http://www.cis.hut.fi/research/javasomdemo/demo2.html

    p. 53/60 Wassner Hubert [email protected]

  • The wide field of neural-networkThis course and document only talks about the most basic and popular neural network. There is a very wide variety of neural network solving very different problems. Here is a non exhaustiv list of other kind of neural computing technics.

    Reinforcement learning : it's an intermediate way between supervided an unsupervised learning. The network is not explicitly given the expected output but rather a reward of punishment according to its output.

    Recurrent network (better suited for some forcasting problem), where ouput at time t is feed as part of NN input for (t+1), in order to model time influance on the model.

    Time delayed NN are used to handle time/space independance (detecting a event no matter where or when it appears)

    Radial Basis Function : sigmoid function is not always the best choice to model some kind of problems, RBF can be an alternative.

    Hybrid method (supervised and non-supervised) : a first stage Kohonen first layer can lower the problem dimension.

    Advantages :

    Lower input dimension (2D , coordinates of the lighten kohonen neurons) (=> lower number of model parameters)

    handle time series of N dimensionnal input space : accumulate lighten SOM neurons with a time decay and feed the map into a feed forward NN.

    For each type of NN you have several training procedures which impact its caracteristics.

    ...

    This is just a small list of known technics, derivatives , new ones and combinations are frequently invented...

    A few web demos :

    reinforcement learning : Robot arm

    http://www.fe.dis.titech.ac.jp/~gen/robot/robodemo.html

    http://iridia.ulb.ac.be/~fvandenb/qlearning/qlearning.html

    cat & mouse

    http://www.cse.unsw.edu.au/~cs9417ml/RL1/applet.html

    Hamming associative memory

    http://neuron.eng.wayne.edu/Hamming/voting.html

    Hopfield network

    http://suhep.phy.syr.edu/courses/modules/MM/sim/hopfield.html

    p. 54/60 Wassner Hubert [email protected]

  • A word on algorithmic complexity

    Training is cpu consuming, but ...

    Training phase is expensive whatever neural technic used. For most applications training is done only once, so it's a transient problem.

    Using a neural network is quite straight forward and generaly does not require high CPU usage.

    So with usual computer you can adress a lot of common problems.

    But ...

    Algorithic complexity of training algorithms can easily be above O(n*N*i)where

    n is the number of data samples

    Training data size is usually as big as possible.

    N is the numer of neurons

    When modeling a big set of complex data, it is quite usual to use big network.

    iis the number of iteration steps (from hundreds to thousands)If n and N parameters are high it is quite frequent that ineeds to be high too.

    So the processing power can be an issue if some of the n,N or I parameters are really high. It becomes a serious issue is if real time answers are requested on such process.

    Solutions

    divide and conquer strategy :

    Since building and training NN need a lot of try, you can easily split them accross different computer on a standard network.

    A gridcomputer or a cluster , with an appropriate NN parallelised software can efficiently split the training accros the processing units.

    NN are not suited to classical CPU even to math processors. They are doing only very simple processing. They are parallel by nature, remember the biological inspiration.

    Today processor are not suited to this kind of processings. It exists specialized hardware to fit that processing need.

    Specialized hardware can outperfom clusters or gridcomputers because of the parallel nature of the NN.

    Specialized hardware is less expensive and is smaller than clusters. It often counts on real life problems.

    p. 55/60 Wassner Hubert [email protected]

  • NN are not magic ! NN are not magic, if the information you are trying to discover is not in

    the data, NN won't find them. You should have evidences or strong intuition than the information is in your data. Tools can be :

    2D marginal projections.

    expert informations.

    We have seen that neural networks can solve several kind of problems :

    classification

    forecasting

    shortest way or traveling salesman problem

    robotic/automatic

    ...

    but sometimes it's not the best in all cases. Well know alternates are :

    Markov chains

    Regression model (better when there is a apriori knowledge about the function like physical or statistical laws).

    K-nearest neighbor

    Adaptative filtering

    ...

    Even if some of these techniquees can be seen as a variation of NN, but they were discovered before the NN analogy.

    Solving a proble with NN needs :

    p. 56/60 Wassner Hubert [email protected]

    Illustration 2: Zero Instruction Set Chip

    Illustration 1: mix of real (wet) neurons and electronic chip

  • training/ testing data

    Neural network :

    a topology (number and disposition of neuron in layers)

    Topology might be an issue, there is very few basic rules and knowledge leading to the right solution all the times. The usual way is to try a lot of different topologies and keep the best one. The only basic rules are :

    the more complex is the problem the more neuron is requested.

    The more neurons the more data you need to train them.

    a training procedureTraining can be viewed as an optimisation problem, as any problem of this kind, local optimum can be a problem.

    NN are suited to solve continous problems : close entry patterns should correspond to close output patterns. Not all real-life problems are of that kind.

    Multimedia data (sound, images, moving pictures) generaly needs a first stage data processing (Fourrier/wavelets transform is usual).

    Important warning : NN are F@#& magic in the Murphy's law context

    If some undetected bugs add noise to the training set the NN can go thru it with more or less success. If no ones detects the bug , you will have poor results believing that the problem is harder than expected (which might not be the case...).

    How to actualy create and use neural networksIt exist today a lot of ways to build NN :

    Libraries

    proprietary libraries : easy but expensive

    free software(open source) libraries : easy and free.

    under LGPL licence or alike, wich you can use on propretairy software !

    It might need a little coding but it's an engineer's job, no ? ;-)

    Modelers using graphical interface

    proprietary nice and expensive

    free softwre(open source) nice and free(installation might need some patience sometimes)

    Programing it your selft from bottom up (need good enginering skills).

    p. 57/60 Wassner Hubert [email protected]

  • When NN are doing better than expertsOn some problems NN can do better that humain expert.

    Why ?

    The parameter number is too high to be processed by a human beeing.

    Human expert might be misled by fasle apriori knowledge.

    Problem can be too easy and repetive to be properly done by a human.

    The problem and parameters might be counterintuitive.

    example : http://ai.bpa.arizona.edu/papers/dog93/dog93.html

    This ability lead to some human problems :

    fear to lose the job : NN can rise the old fear to be replaced by a machine.

    expert resistance : Example : a computer science engineer can't solve a biological problem better/quicker than an phd expert in biology, physic or finance, ...

    Because of the above phenomenon finding NN problem to solve is a hard task , you can't really expect experts to bring you NN projects.

    Deeper Deeper inside : Exercises You have now all information you need to build a NN yourself. All you need

    is to walk along the shore (rember the fractal approch ), navigation points are paragraph in the deeper inside section of this document. You will follow these guidelines to help you building your datasets, training the network and testing it.

    The remaining ressources you need are NN programming library and datasets. NN library can be found in the software section of the bibliography. Datasets are harder to find as explained in the Datasets cost section. Some public dataset are listed below.

    Datasets

    web : http://kdd.ics.uci.edu/

    protein localisation ftp://ftp.ics.uci.edu/pub/machine-learning-databases/ecoli/

    spam detection ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase/

    image satelite : http://kdd.ics.uci.edu/databases/covertype/

    add detection : ftp://ftp.ics.uci.edu/pub/machine-learning-databases/internet_ads/

    p. 58/60 Wassner Hubert [email protected]

  • promoters : ftp://ftp.ics.uci.edu/pub/machine-learning-databases/molecular-biology/

    optic digits : ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/

    KDD98 cup (mailing response) : http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

    KDD99 cup (internet intrusion detection) : http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

    EEG :(alcoholic|non-a.) http://kdd.ics.uci.edu/databases/eeg/eeg.html

    collection of machine learning dataset proben1.tar.gz ftp://ftp.ira.uka.de/pub/neuron

    ...

    Bibliography

    Bibliography

    Format is : title , authors, editor.comment.

    Rseaux de neurones, mthodologies et applications , sous la direction de Grard Dreyfus, Eyrolles.Great Book.

    Pattern Classification Duda, Hart, Stork, Wiley-interscience.Good book on pattern classification (NN is only one chapter).

    Neural Network, algorithmes, applications and programming techniques , James A. Freeman, David M. Skapura, Addison Wesley.Quite an old book with a lot of illustration and even code sample (in Pascal ?). Can be a good complement to this course.

    Cybernetique des Rseaux Neuronaux , Alain Faure, Hermes.

    Real neuron biology described by engineers, very few about artificial neurons.

    Les Rseaux neuromimtiques , Jean-Franois Jodouin, Hermes.Clear explanations, nice illustrations, even code sample, it's a good book.

    p. 59/60 Wassner Hubert [email protected]

  • Web sites

    http://www.wikipedia.org/ Wikipedia (lot of paragraphs directly comes out of Wikipedia)

    http://leenissen.dk/fann/

    http://www.google.com/Top/Computers/Artificial_Intelligence/Neural_Net works/Companies/

    http://www.google.com/Top/Computers/Artificial_Intelligence/Conferences _and_Events/

    ...

    Softwares

    Open source

    FANN http://leenissen.dk/fann/

    SNNS http://www-ra.informatik.uni-tuebingen.de/SNNS/

    scilab http://www.scilab.org

    http://www.scilab.org/contrib/displayContribution.php?fileID=166

    ...

    Proprietary

    Mathematica http://www.wolfram.com/products/applications/neuralnetworks/

    matlab http://www.mathworks.com/products/neuralnet/

    ...

    p. 60/60 Wassner Hubert [email protected]

    Course prsentationPrerequisits Resume Presented techniquesObjectives

    IntroductionForeword on fractal approachBionics (From Wikipedia, the free encyclopedia)Neuron (From Wikipedia, the free encyclopedia)ContentsHistoryAnatomy and histologyClassesConnectivityAdaptations to carrying action potentialsChallenges to the neuron doctrineNeurons in the brainSee alsoSourcesExternal links

    Artificial neuron (From Wikipedia, the free encyclopedia)ContentsBasic structureHistoryTypes of transfer functionsBibliography

    Mathematical model and properties of Artificial Neural NetworksMathematical modelProperties

    Training