anthony kuh- neural networks and learning theory
Post on 30-Sep-2014
Embed Size (px)
Neural Networks and Learning TheorySpring 2005Prof. Anthony Kuh POST 205E Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427 Email: email@example.com
PreliminariesClass Meeting Time: MWF 8:30-9:20 Office Hours: MWF 10-11 (or by appointment) Prerequisites: Probability: EE342 or equivalent Linear Algebra: Programming: Matlab or C experience
I. Introduction to neural networksGoal: study computational capabilities of neural network and learning systems. Multidisciplinary field Algorithms, Analysis, Applications
A. MotivationWhy study neural networks and machine learning? Biological inspiration (natural computation) Nonparametric models: adaptive learning systems, learning from examples, analysis of learning models Implementation Applications Cognitive (Human vs. Computer Intelligence): Humans superior to computers in pattern recognition, associative recall, learning complex tasks. Computers superior to humans in arithmetic computations, simple repeatable tasks. Biological: (study human brain) 10^10 to 10^11 neurons in cerebral cortex with on average of 10^3 interconnections / neuron.
Schematic of one neuron
Neural NetworkConnection of many neurons together forms a neural network.Neural network properties: Highly parallel (distributed computing) Robust and fault tolerant Flexible (short and long term learning) Handles variety of information (often random, fuzzy, and inconsistent) Small, compact, dissipates very little power
B. Single Neuron(Computational node)
s=w T x + w0 ; synaptic strength (linearly weighted sum of inputs). y=g(s); activation or squashing function
Activation functionsLinear units: g(s) = s. Linear threshold units: g(s) = sgn (s). Sigmoidal units: g(s) = tanh (Bs), B >0.Neural networks generally have nonlinear activation functions.
Most popular models: linear threshold units and sigmoidal units. Other types of computational units : receptive units (radial basis functions).
C. Neural Network ArchitecturesSystems composed of interconnected neurons
Neural network represented by directed graph: edges represent weights and nodes represent computational units.
DefinitionsFeedforward neural network has no loops in directed graph. Neural networks are often arranged in layers. Single layer feedforward neural network has one layer of computational nodes. Multilayer feedforward neural network has two or more layers of computational nodes. Computational nodes that are not output nodes are called hidden units.
D. Learning and Information Storage1.
Neural networks have computational capabilities.Where is information stored in a neural network? What are parameters of neural network?
How does a neural network work? (two phases)Training or learning phase (equivalent to write phase in conventional computer memory): weights are adjusted to meet certain desired criterion. Recall or test phase (equivalent to read phase in conventional computer memory): weights are fixed as neural network realizes some task.
Learning and Information (continued)3) What can neural network models learn?Boolean functions Pattern recognition problems Function approximation Dynamical systems
4) What type of learning algorithms are there?Supervised learning (learning with a teacher) Unsupervised learning (no teacher) Reinforcement learning (learning with a critic)
Learning and Information (continued)5) How do neural networks learn?Iterative algorithm: weights of neural network are adjusted online as training data is received. w(k+1) = L(w(k),x(k),d(k)) for supervised learning where d(k) is desired output. Need cost criterion: common cost criterion Mean Squared Error: for one output J(w) = (y(k) d(k)) 2 Goal is to find minimum J(w) over all possible w. Iterative techniques often use gradient descent approaches.
Learning and Information (continued)6)Learning and GeneralizationLearning algorithm takes training examples as inputs and produces concept, pattern or function to be learned. How good is learning algorithm? Generalization ability measures how well learning algorithm performs.Sufficient number of training examples. (LLN, typical sequences) Occams razor: simplest explanation is the best.
+ + +Regression problem
Learning and Information (continued)Generalization error g = emp + model Empirical error: average error from training data (desired output vs. actual output) Model error: due to dimensionality of class of functions or patterns Desire class to be large enough so that empirical error is small and small enough so that model error is small.
II. Linear threshold unitsA. Preliminaries
sgn(s)= 1, if s>=0 -1, if s 0 go to 5) w(k+1) = w(k ) + x(k)d(k) k=k+1, check if cycled through data, if not go to 2 Otherwise stop.
PLA commentsPerceptron convergence theorem (requires margins) Sketch of proof Updating threshold weights Algorithm is based on cost function J(w) = - (sum of synaptic strengths of misclassified points) w(k+1) = w(k) - (k)J(w(k)) (gradient descent)
Perceptron Convergence TheoremAssumptions: w* solutions and ||w*||=1, no threshold and w(0)=0. Let max||x(k)||= and min y(k)x(k)Tw*=. = + k . ||w(k)||2 ||w(k-1)||2 + ||x(k-1)||2 ||w(k-1)||2 + 2 k 2 . Implies that k ( / ) 2 (max number of updates).
III. Linear UnitsA. Preliminaries
Model Assumptions and ParametersTraining examples (x(k),d(k)) drawn randomly Parameters
Inputs: x(k) Outputs: y(k) Desired outputs: d(k) Weights: w(k) Error: e(k)= d(k)-y(k)
Error criterion (MSE)min J(w) = E [.5(e(k)) 2]
Wiener solutionDefine P= E(x(k)d(k)) and R=E(x(k)x(k)T).J(w) =.5 E[(d(k)-y(k))2] = .5E(d(k)2)- E(x(k)d(k)) Tw +wT E(x(k)x(k) T)w = .5E[d(k) 2] PTw +.5wTRw
Note J(w) is a quadratic function of w. To minimize J(w) find gradient, J(w) and set to 0.J(w) = -P + Rw = 0 Rw=P (Wiener solution) If R is nonsingular, then w= R-1 P. Resulting MSE = .5E[d(k)2]-PTR-1P
Iterative algorithmsSteepest descent algorithm (move in direction of negative gradient)w(k+1) = w(k) - J(w(k)) = w(k) + (P-Rw(k))
Least mean square algorithm(approximate gradient from training example) J(w(k))= -e(k)x(k) w(k+1) = w(k) + e(k)x(k)
Steepest Descent Convergencew(k+1) = w(k) + (P-Rw(k)); Let w* be solution. Center weight vector v=w-w* v(k+1) = v(k) - (Rw(k)); Assume R is nonsingular. Decorrelate weight vector u= Q-1v where R=Q Q-1 is the transformation that diagonalizes R. u(k+1) = (I - ), u(k) = (I - )k u(0). Conditions for convergence 0< < 2/max .
LMS Algorithm PropertiesSteepest Descent and LMS algorithm convergence depends on step size and eigenvalues of R. LMS algorithm is simple to implement. LMS algorithm convergence is relatively slow. Tradeoff between convergence speed and excess MSE. LMS algorithm can track training data that is time varying.
Adaptive MMSE MethodsTraining dataLinear MMSE: LMS, RLS algorithms Nonlinear Decision feedback detectors
Blind algorithmsSecond order statisticsMinimum Output Energy Methods Reduced order approximations: PCA, multistage Wiener Filter
Higher order statisticsCumulants, Information based criteria
Designing a learning systemGiven a set of training data, design a system that can realize the desired task.
IV. Multilayer NetworksA. Capabilities
Depend directly on total number of weights and threshold values. A one hidden layer network with sufficient number of hidden units can arbitrarily approximate any boolean function, pattern recognition problems, and well behaved function approximation problems. Sigmoidal units more powerful than linear threshold units.
B. Error backpropagationError backpropagation algorithm: methodical way of implementing LMS algorithm for multilayer neural networks.Two passes: forward pass (computational pass), backward pass (weight correction pass). Analog computations based on MSE criterion. Hidden units usually sigmoidal units. Initialization: weights take on small random values. Algorithm may not converge to global minimum. Algorithm converges slower than for linear networks. Representation is distributed.
BP Algorithm Commentss are error terms computed from output layer back to first layer in dual network. Training is usually done online. Examples presented in random or sequential order. Update rule is local as weight changes only involve connections to weight. Computational complexity depends on number of computational units. Initial weights randomized to avoid converging to local minima.
BP Algorithm Comment continuedThreshold weights updated in similar manner to other weights (input =1). Momentum term added to speed up convergence. Step size set to small value. Sigmoidal activation derivatives simple to compute.
Output of computational values calculated
Output of error terms calculated
Modifications to BP AlgorithmBatch procedure Variable step size Better approximation of gradient method (momentum term, conjugate gradient) Newton methods (Hessian) Alternate cost functions Regularization Network construction algorithms Incorporating time
When to stop trainingFirst major features captured. As training continues minor features captured. Look at training error. Crossvalidation (training, validation, and test sets) testing error training error Lear