artificial neural networks - mit opencourseware · pdf fileartificial neural networks ......
TRANSCRIPT
Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision Support Spring 2005
ArtificialNeural Networks
Lucila Ohno-Machado (with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)
Overview
• Motivation • Perceptrons • Multilayer perceptrons • Improving generalization
• Bayesian perspective
Motivation
Images removed due to copyright reasons.
benign lesion malignant lesionbenign lesion malignant lesion
Motivation
• Human brain – Parallel processing – Distributed representation – Fault tolerant – Good generalization capability
• Mimic structure and processing in computational model
Biological Analogy
Synapses
Axon
Dendrites
Synapses+ + + --
(weights)
Nodes
Perceptrons
00
01 11
0010
10
00 11 Input patterns
Input layer
Output layer
11
01 11
11
1100
00
00 10
10 Sorted
.patterns
Activation functions
Perceptrons (linear machines)Input units
Cough Headache
weights Δ rule change weights todecrease the error
- what we gotwhat we wantedNo disease Pneumonia Flu Meningitis error
Output units
0 1 0 0000Abdominal Pain Perceptron
Intensity DurationMale Age Temp WBC Pain Pain
adjustable weights37 10 11 120
AppendicitisDiverticulitis Ulcer Pain Cholecystitis Obstruction Pancreatitis Duodenal Non-specific Small Bowel Perforated
ANDy
x1 x2
w1 w2
θ = 0.5
input output 00 01 10 11
0 0 0 1
f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 0 f(1w1 + 0w2) = 0 f(1w1 + 1w2) = 1
1, for a > θf(a) = 0, for a ≤ θ
θ
some possible values for w1 and w2 w1 w2
0.20 0.20 0.25 0.40
0.35 0.40 0.30 0.20
Single layer neural network
Output of unit j:
Input units
Input to unit
j
i
Input to unit
units
measured value of variable
Output +θj)oj = 1/ (1 + e- (aj )
j: aj = Σ wijai
i: ai
i
Single layer neural networkOutput of unit j:
Output oj = 1/ (1 + e - ( aj+θj)
units
Input to unit
j
i
Input to unit
measured
Input units
j: aj = Σ wijai
i: a i
value of variable i Increasing θ
0
1
-5 0 5
i
i
i
0.2
0.4
0.6
0.8
1.2
-15 -10 10 15
Ser es1
Ser es2
Ser es3
Training: Minimize ErrorInput units
Cough Headache
weights Δ rule change weights todecrease the error
- what we gotwhat we wantedNo disease Pneumonia Flu Meningitis error
Output units
Error Functions
• Mean Squared Error (for regression problems), where t is target, o is output
Σ(t - o)2/n
• Cross Entropy Error (for binary classification)
− Σ(t log o) + (1-t) log (1-o)
+θj)oj = 1/ (1 + e- (aj )
Error function
• Convention: w := (w0,w), x := (1,x) • w0 is “bias” y
x1 x2
w1 w2
θ = 0.5
• o = f (w • x) • Class labels ti ∈{+1,-1} • Error measure
– E = -Σ ti (w • x ) oi miscl.
i w
• How to minimize E?
y
x1 x2
w1 w2
1
= 0.5
Minimizing the Error
initial error
final error
local minimum
Error surface
derivative
initial trainedw w
Gradient descent
Local minimum
Global minimum
Error
Perceptron learning
w• Find minimum of E by iterating
k+1 = wk – η gradw E
• E = -Σ ti (w • x ) ⇒ii miscl.
gradw E = -Σ ti xii miscl.
w• “online” version: pick misclassified x
k+1 = wk + η ti xi
i
Perceptron learning
• Update rule wk+1 = wk + η ti xi
• Theorem: perceptron learning converges for linearly separable sets
Gradient descent• Simple function minimization algorithm
• Gradient is vector of partial derivatives
• Negative gradient is direction of steepest descent
3
3
2
2
1
10
0
20
10
00
0
1 1
2
2
3
3
Figures by MIT OCW.
Classification ModelInputs
Weights Output
Age 34
1
4 of beingAlive”
5
8
4 Σ 0.6
Gender “Probability
Stage
DependentIndependent Coefficients variable variables a, b, c px1, x2, x3
Prediction
Terminology
• Independent variable = input variable
• Dependent variable = output variable
• Coefficients = “weights” • Estimates = “targets”
• Iterative step = cycle, epoch
XORy
x1 x2
w1 w2
θ = 0.5
input output 00 01 10 11
0 1 1 0
f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0
some possible values for w1 and w2 w1 w2
1, for a > θf(a) = 0, for a ≤ θ
θ
XOR
x 1 x 2
y input output
00 01 10 11
0 1 1 0
θ = 0.5 w 3 w5 w 4
z θ = 0.5
w1 w2
f(w1, w2, w3, w4, w5)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5) f(a) = 0, for a ≤ θ
(0.3,0.3,1,1,-2) θ
XORinput output
00 01 10 11
0 1 1 0
w1 w3 w2
w5 w6 θ = 0.5 for all units
w4
f(w1, w2, w3, w4, w5 , w6)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) = 0, for a ≤ θ
(0.6,-0.6,-0.7,0.8,1,1) θ
From perceptrons to multilayer perceptrons
Why?
Abdominal Pain
37 10 1
Appendicitis Diverticulitis
Perforated Non-specificCholecystitis
Small BowelPancreatitis
1 20
Male Age Temp WBC PainIntensity
1
PainDuration
0 1 0 0000
adjustableweights
ObstructionPainDuodenal Ulcer
0.8
Heart Attack Network
Duration Intensity ECG: ST Pain Pain elevationSmoker Age Male
Myocardial Infarction
112 1504
“Probability” of MI
Multilayered Perceptrons
Input units
Input to unit j: aj = Σwijai
j
i
Input to unit i: ai i
Output of unit j: oj = 1/ (1 + e- (aj+θj) )
Output units
Perceptron
MultilayeredInput to unit k:
perceptron
Output of unit k: ok = 1/ (1 + e- (ak+θk) )
k
Hidden units
ak = Σwjkoj
measured value of variable
Neural Network ModelInputs
Age 34
2
4
.6
.5
.8
.2
.1
.3.7
.2
ΣΣ
.4
.2 Σ
Output
0.6 Gender
“Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
“Combined logistic models”Inputs
Age 34
2
4
.6
.5
.8
.1
.7 Σ
Output
0.6 Gender
“Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Inputs
Age Output34
2
4
.5
.8 .2
.3
.2
Σ 0.6
Gender “Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Inputs
Age 34
1
4
.6.5
.8.2
.1
.3.7
.2
Σ
Output
0.6 Gender
“Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Not really, no target for hidden units...
Age 0.6
Gender
34
2
4
.6
.5
.8
.2
.1
.3.7
.2
ΣΣ
.4
.2 Σ “Probability
Stage of beingAlive”
Independent Weights Hidden Weights Dependent
variables Layer variable
Prediction
Hidden Units and Backpropagation
Output units
Hidden units
what we gotwhat we wanted -error
Δ rule
Δ rule
bai
ckpr opag ato n
Input units
Multilayer perceptrons
• Sigmoidal hidden layer • Can represent arbitrary decision regions
• Can be trained similar to perceptrons
ECG Interpretation
R-R interval
S-T elevation
QRS duration
QRS amplitude
AVF lead
SV tachycardia
Ventricular tachycardia
LV hypertrophy
RV hypertrophy
Myocardial infarction
P-R interval
Linear SeparationSeparate n-dimensional space using one (n - 1)-dimensional space
Meningitis Flu No cough CoughHeadache Headache
CoughNo coughNo headache No headache
No disease Pneumonia
No treatment
00 10
01 11
000
010
101
111
110
Treatment
011
100
Another way of thinking about this…
• Have data set D = {(x ,t )} drawn fromi iprobability distribution P(x,t)
• Model P(x,t) given samples D by ANNwith adjustable parameter w
• Statisticsanalogy:
Maximum Likelihood Estimation
• Maximize likelihood of data D
• Likelihood L = Π p(x ,t ) = Π p(t |x )p(x )i i i i i
• Minimize -log L = -Σ log p(t |x ) -Σ log p(x )i i i
• Drop second term: does not depend on w
• Two cases: “regression” and classification
Likelihood for classification(ie categorical target)
• For classification, targets t are class labels
• Minimize -Σ log p(t |x )i i
• p(t 1-ti|x ) = y(x ,w) ti (1- y(x ,w)) ⇒i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i
• Minimizing –log L equivalent to minimizing -[Σ t log y(x ,w) +(1 – t ) * log(1-y(x ,w))]i i i i(cross-entropy error)
Likelihood for “regression”(ie continuous target)
• For regression, targets t are real values
• Minimize -Σ log p(t |x )i i
• p(t |x ) = 1/Z exp(-(y(x ,w) – t )2/(2σ2)) ⇒i i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Zi i i i
• y(x ,w) is network outputi
• Minimizing –log L equivalent to minimizing Σ (y(x ,w) – t )2 (sum-of-squares error)i i
Backpropagation algorithm
w
• Minimizing error function by gradient descent:
k+1 = wk – η gradw E • Iterative gradient
calculation by propagating error signals
Backpropagation algorithm
Problem: how to set learning rate η ?
Better: use more advanced minimization algorithms (second-order information)
2
2
0
0
-2
2
0
-2
-4 -2 20-4 -2
Figures by MIT OCW.
Backpropagation algorithm
Classification Regression
cross-entropy sum-of-squares
Overfitting
ModelReal Distribution Overfitted
Improving generalization
Problem: memorizing (x,t) combinations (“overtraining”)
0.7 0.5 0 0.9 -0.5 1
1 -1.2 -0.2
0.3 0.6 1
0.5 -0.2 ?
Improving generalization
• Need test set to judge performance
• Goal: represent information in data set, not noise
• How to improve generalization?– Limit network topology – Early stopping – Weight decay
Limit network topology
• Idea: fewer weights ⇒ less flexibility
• Analogy to polynomial interpolation:
Limit network topology
Early StoppingC
HD
error holdout
training
Overfitted model “Real” model Overfitted model
0 age cycles
Early stopping
• Idea: stop training when information (butnot noise) is modeled
• Need hold-out set to determine when to stop training
Overfitting
b = training set
a = hold-out set
Overfitted model
tss
Epochs
min (Δtss)
tss a
tss b
Stopping criterion
Early stopping
Weight decay
• Idea: control smoothness of network output by controlling size of weights
• Add term α||w||2 to error function
Weight decay
Bayesian perspective
• Error function minimization corresponds to maximum likelihood (ML) estimate: singlebest solution wML
• Can lead to overtraining• Bayesian approach: consider weight
posterior distribution p(w|D).
Bayesian perspective
• Posterior = likelihood * prior • p(w|D) = p(D|w) p(w)/p(D) • Two approaches to approximating p(w|D):
– Sampling – Gaussian approximation
Sampling from p(w|D)
prior likelihood
Sampling from p(w|D)
prior * likelihood = posterior
Bayesian example for regression
Bayesian example for classification
Model Features(with strong personal biases)
Modeling Examples Explanat. Effort Needed
Rule-based Exp. Syst. high low high? Classification Trees low high+ “high” Neural Nets, SVM low high low Regression Models high moderate moderate
Learned Bayesian Nets low high+ high (beautiful when it works)
Regression vs. Neural Networks
“X1 ” 1X3 ” “X1X2X3 ”
Y
“X2 ”
X1 X2 X3 X1X2 X1X3 X2X3
Y
(23 -1) possible combinations
X1X2X3
“X
X1 X2 X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
Summary
• ANNs inspired by functionality of brain
• Nonlinear data model • Trained by minimizing error function • Goal is to generalize well • Avoid overtraining • Distinguish ML and MAP solutions
Some References
Introductory and Historical Textbooks • Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed
Processing. MIT Press, Cambridge, 1986. (H) • Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of
Neural Computation. Addison-Wesley, Redwood City, 1991. • Pao, YH. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, 1989. • Bishop CM. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995.