# anthony kuh- neural networks and learning theory

Post on 30-Sep-2014

18 views

Embed Size (px)

TRANSCRIPT

Neural Networks and Learning TheorySpring 2005Prof. Anthony Kuh POST 205E Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427 Email: kuh@spectra.eng.hawaii.edu

EE645

PreliminariesClass Meeting Time: MWF 8:30-9:20 Office Hours: MWF 10-11 (or by appointment) Prerequisites: Probability: EE342 or equivalent Linear Algebra: Programming: Matlab or C experience

I. Introduction to neural networksGoal: study computational capabilities of neural network and learning systems. Multidisciplinary field Algorithms, Analysis, Applications

A. MotivationWhy study neural networks and machine learning? Biological inspiration (natural computation) Nonparametric models: adaptive learning systems, learning from examples, analysis of learning models Implementation Applications Cognitive (Human vs. Computer Intelligence): Humans superior to computers in pattern recognition, associative recall, learning complex tasks. Computers superior to humans in arithmetic computations, simple repeatable tasks. Biological: (study human brain) 10^10 to 10^11 neurons in cerebral cortex with on average of 10^3 interconnections / neuron.

A neuron

Schematic of one neuron

Neural NetworkConnection of many neurons together forms a neural network.Neural network properties: Highly parallel (distributed computing) Robust and fault tolerant Flexible (short and long term learning) Handles variety of information (often random, fuzzy, and inconsistent) Small, compact, dissipates very little power

B. Single Neuron(Computational node)

x

w

w0

s

g( )

y

s=w T x + w0 ; synaptic strength (linearly weighted sum of inputs). y=g(s); activation or squashing function

Activation functionsLinear units: g(s) = s. Linear threshold units: g(s) = sgn (s). Sigmoidal units: g(s) = tanh (Bs), B >0.Neural networks generally have nonlinear activation functions.

Most popular models: linear threshold units and sigmoidal units. Other types of computational units : receptive units (radial basis functions).

C. Neural Network ArchitecturesSystems composed of interconnected neurons

inputs

output

Neural network represented by directed graph: edges represent weights and nodes represent computational units.

DefinitionsFeedforward neural network has no loops in directed graph. Neural networks are often arranged in layers. Single layer feedforward neural network has one layer of computational nodes. Multilayer feedforward neural network has two or more layers of computational nodes. Computational nodes that are not output nodes are called hidden units.

D. Learning and Information Storage1.

Neural networks have computational capabilities.Where is information stored in a neural network? What are parameters of neural network?

2.

How does a neural network work? (two phases)Training or learning phase (equivalent to write phase in conventional computer memory): weights are adjusted to meet certain desired criterion. Recall or test phase (equivalent to read phase in conventional computer memory): weights are fixed as neural network realizes some task.

Learning and Information (continued)3) What can neural network models learn?Boolean functions Pattern recognition problems Function approximation Dynamical systems

4) What type of learning algorithms are there?Supervised learning (learning with a teacher) Unsupervised learning (no teacher) Reinforcement learning (learning with a critic)

Learning and Information (continued)5) How do neural networks learn?Iterative algorithm: weights of neural network are adjusted online as training data is received. w(k+1) = L(w(k),x(k),d(k)) for supervised learning where d(k) is desired output. Need cost criterion: common cost criterion Mean Squared Error: for one output J(w) = (y(k) d(k)) 2 Goal is to find minimum J(w) over all possible w. Iterative techniques often use gradient descent approaches.

Learning and Information (continued)6)Learning and GeneralizationLearning algorithm takes training examples as inputs and produces concept, pattern or function to be learned. How good is learning algorithm? Generalization ability measures how well learning algorithm performs.Sufficient number of training examples. (LLN, typical sequences) Occams razor: simplest explanation is the best.

+

+ + +Regression problem

+

+

Learning and Information (continued)Generalization error g = emp + model Empirical error: average error from training data (desired output vs. actual output) Model error: due to dimensionality of class of functions or patterns Desire class to be large enough so that empirical error is small and small enough so that model error is small.

II. Linear threshold unitsA. Preliminaries

x

w

w0

s

sgn( )

y

sgn(s)= 1, if s>=0 -1, if s 0 go to 5) w(k+1) = w(k ) + x(k)d(k) k=k+1, check if cycled through data, if not go to 2 Otherwise stop.

PLA commentsPerceptron convergence theorem (requires margins) Sketch of proof Updating threshold weights Algorithm is based on cost function J(w) = - (sum of synaptic strengths of misclassified points) w(k+1) = w(k) - (k)J(w(k)) (gradient descent)

Perceptron Convergence TheoremAssumptions: w* solutions and ||w*||=1, no threshold and w(0)=0. Let max||x(k)||= and min y(k)x(k)Tw*=. = + k . ||w(k)||2 ||w(k-1)||2 + ||x(k-1)||2 ||w(k-1)||2 + 2 k 2 . Implies that k ( / ) 2 (max number of updates).

III. Linear UnitsA. Preliminaries

x

w

s=y

Model Assumptions and ParametersTraining examples (x(k),d(k)) drawn randomly Parameters

Inputs: x(k) Outputs: y(k) Desired outputs: d(k) Weights: w(k) Error: e(k)= d(k)-y(k)

Error criterion (MSE)min J(w) = E [.5(e(k)) 2]

Wiener solutionDefine P= E(x(k)d(k)) and R=E(x(k)x(k)T).J(w) =.5 E[(d(k)-y(k))2] = .5E(d(k)2)- E(x(k)d(k)) Tw +wT E(x(k)x(k) T)w = .5E[d(k) 2] PTw +.5wTRw

Note J(w) is a quadratic function of w. To minimize J(w) find gradient, J(w) and set to 0.J(w) = -P + Rw = 0 Rw=P (Wiener solution) If R is nonsingular, then w= R-1 P. Resulting MSE = .5E[d(k)2]-PTR-1P

Iterative algorithmsSteepest descent algorithm (move in direction of negative gradient)w(k+1) = w(k) - J(w(k)) = w(k) + (P-Rw(k))

Least mean square algorithm(approximate gradient from training example) J(w(k))= -e(k)x(k) w(k+1) = w(k) + e(k)x(k)

^

Steepest Descent Convergencew(k+1) = w(k) + (P-Rw(k)); Let w* be solution. Center weight vector v=w-w* v(k+1) = v(k) - (Rw(k)); Assume R is nonsingular. Decorrelate weight vector u= Q-1v where R=Q Q-1 is the transformation that diagonalizes R. u(k+1) = (I - ), u(k) = (I - )k u(0). Conditions for convergence 0< < 2/max .

LMS Algorithm PropertiesSteepest Descent and LMS algorithm convergence depends on step size and eigenvalues of R. LMS algorithm is simple to implement. LMS algorithm convergence is relatively slow. Tradeoff between convergence speed and excess MSE. LMS algorithm can track training data that is time varying.

Adaptive MMSE MethodsTraining dataLinear MMSE: LMS, RLS algorithms Nonlinear Decision feedback detectors

Blind algorithmsSecond order statisticsMinimum Output Energy Methods Reduced order approximations: PCA, multistage Wiener Filter

Higher order statisticsCumulants, Information based criteria

Designing a learning systemGiven a set of training data, design a system that can realize the desired task.

Inputs

Signal Processing

Feature Extraction

Neural Network

Outputs

IV. Multilayer NetworksA. Capabilities

Depend directly on total number of weights and threshold values. A one hidden layer network with sufficient number of hidden units can arbitrarily approximate any boolean function, pattern recognition problems, and well behaved function approximation problems. Sigmoidal units more powerful than linear threshold units.

B. Error backpropagationError backpropagation algorithm: methodical way of implementing LMS algorithm for multilayer neural networks.Two passes: forward pass (computational pass), backward pass (weight correction pass). Analog computations based on MSE criterion. Hidden units usually sigmoidal units. Initialization: weights take on small random values. Algorithm may not converge to global minimum. Algorithm converges slower than for linear networks. Representation is distributed.

BP Algorithm Commentss are error terms computed from output layer back to first layer in dual network. Training is usually done online. Examples presented in random or sequential order. Update rule is local as weight changes only involve connections to weight. Computational complexity depends on number of computational units. Initial weights randomized to avoid converging to local minima.

BP Algorithm Comment continuedThreshold weights updated in similar manner to other weights (input =1). Momentum term added to speed up convergence. Step size set to small value. Sigmoidal activation derivatives simple to compute.

BP Architecture

Forward network

Output of computational values calculated

Output of error terms calculated

Sensitivity network

Modifications to BP AlgorithmBatch procedure Variable step size Better approximation of gradient method (momentum term, conjugate gradient) Newton methods (Hessian) Alternate cost functions Regularization Network construction algorithms Incorporating time

When to stop trainingFirst major features captured. As training continues minor features captured. Look at training error. Crossvalidation (training, validation, and test sets) testing error training error Lear