artificial neural networks 2 morten nielsen biosys, dtu
TRANSCRIPT
![Page 1: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/1.jpg)
Artificial Neural Networks 2
Morten NielsenBioSys, DTU
![Page 2: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/2.jpg)
Outline
• Optimization procedures – Gradient decent (this you already know)
• Network training– back propagation– cross-validation– Over-fitting– examples
![Page 3: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/3.jpg)
Neural network. Error estimate
I1 I2
w1 w2
Linear function
€
o = I1 ⋅w1 + I2 ⋅w2
€
E = 12 ⋅(o− t)
2
o
![Page 4: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/4.jpg)
Neural networks
![Page 5: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/5.jpg)
Gradient decent (from wekipedia)
Gradient descent is based on the observation that if the real-valued function F(x) is defined and differentiable in a neighborhood of a point a, then F(x) decreases fastest if one goes from a in the direction of the negative gradient of F at a. It follows that, if
for > 0 a small enough number, then F(b)<F(a)
€
b = a−ε ⋅∇F(a)
![Page 6: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/6.jpg)
Gradient decent (example)
€
F(x) = x 2
a = 2
F(a) = 4
![Page 7: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/7.jpg)
Gradient decent (example)
€
F(x) = x 2
a = 2
F(a) = 4
∂F
∂x= 2⋅ x = 4
b = a −ε ⋅ ∇F(a) = 2 −0.1⋅ 4 =1.6
![Page 8: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/8.jpg)
Gradient decent. Example
Weights are changed in the opposite direction of the gradient of the error
€
wi' = wi + Δwi
E = 12⋅ (O − t)
2
O = wii
∑ ⋅ Ii
Δwi = −ε ⋅∂E
∂wi= −ε ⋅
∂E
∂O⋅∂O
∂wi−ε ⋅ (O − t)⋅ Ii
I1 I2
w1 w2
Linear function
€
O = I1 ⋅w1 + I2 ⋅w2
o
![Page 9: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/9.jpg)
What about the hidden layer?
€
Δwi = −ε ⋅∂E
∂wi
oi = wijj
∑ ⋅H j
h j = v jkk
∑ ⋅ Ik
€
E = 12 ⋅(O− t)
2
O = g(o),H = g(h)
g(x) =1
1+ e−x
![Page 10: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/10.jpg)
Hidden to output layer
€
∂E∂wij
=
€
∂E∂Oi
⋅∂Oi∂oi
⋅∂oi∂wij
€
∂E(O(o(wij )))∂wi
=
![Page 11: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/11.jpg)
Hidden to output layer
€
O = g(o)
g(x) =1
1+ e−x
g'(x) =−1
(1+ e−x )2⋅(−e−x )
= (1− g(x)) ⋅g(x)
€
∂E∂wij
=∂E
∂Oi⋅∂Oi∂oi
⋅∂oi∂wij
€
=(O− t) ⋅g'(oi) ⋅H j
€
∂E∂Oi
= (Oi − t)
∂Oi∂oi
=∂g
∂oi= ?
∂oi∂wij
=1
∂wijwlj ⋅
l
∑ H j = H j
![Page 12: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/12.jpg)
Hidden to output layer
€
Oi = g(oi)
g'(oi) = (1− g(oi))⋅ g(oi)
= (1−Oi)⋅Oi
€
∂E∂wij
=∂E
∂Oi⋅∂Oi∂oi
⋅∂oi∂wij
= (Oi − ti)⋅ (1−Oi)⋅Oi⋅H j
€
=(Oi − ti)⋅ g'(oi)⋅H j
![Page 13: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/13.jpg)
Input to hidden layer
€
∂E∂v jk
=∂E
∂H j
⋅∂H j
∂v jk
![Page 14: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/14.jpg)
Input to hidden layer
€
∂E∂v jk
=∂E
∂H j
⋅∂H j
∂v jk∂E
∂H j
=∂E
∂oii
∑ ⋅∂oi∂H j
=∂E
∂oii
∑ ⋅wij =∂E
∂Oii
∑ ⋅∂Oi∂oi
⋅wij
∂H j
∂v jk=∂H j
∂h j⋅∂h j∂v jk
= g'(h j ) ⋅ Ik
€
=g'(h j ) ⋅ Ik ⋅ (Oii
∑ − t) ⋅g'(oi) ⋅wij
![Page 15: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/15.jpg)
Input to hidden layer
€
∂E∂v jk
=∂E
∂H j
⋅∂H j
∂v jk∂E
∂H j
=∂E
∂oii
∑ ⋅∂oi∂H j
=∂E
∂oii
∑ ⋅wij =∂E
∂Oii
∑ ⋅∂Oi∂oi
⋅wij
∂H j
∂v jk=∂H j
∂h j⋅∂h j∂v jk
= g'(h j ) ⋅ Ik
€
=g'(h j )⋅ Ik ⋅ (Oi − ti)⋅ g'(oi)⋅wiji
∑
![Page 16: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/16.jpg)
Summary
€
∂E∂wij
= (Oi − ti)⋅ g'(oi)⋅H j
∂E
∂v jk= g'(h j )⋅ Ik ⋅ (Oi
i
∑ − ti)⋅ g'(oi)⋅wij
![Page 17: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/17.jpg)
Or
€
∂E∂wij
= (Oi − ti)⋅ g'(oi)⋅H j = δ i⋅H j
∂E
∂v jk= g'(h j )⋅ Ik ⋅ (Oi
i
∑ − ti)⋅ g'(oi)⋅wij = g'(h j )⋅ Ik ⋅ δ i⋅i
∑ wij
δ i= (Oi − ti)⋅ g'(oi)
![Page 18: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/18.jpg)
Or
€
∂E∂wij
= δ i⋅H j = δ i⋅ x[1][ j]
∂E
∂v jk= g'(h j )⋅ Ik ⋅ δ i⋅
i
∑ wij = x[1][ j]⋅ (1− x[1][ j])⋅ Ik ⋅ δ i⋅i
∑ wij
δ i= (Oi − ti)⋅ g'(oi) = (x[2][i] − ti)⋅ x[2][i]⋅ (1− x[2][i])
Ii=X[0][k]
Hj=X[1][j]
Oi=X[2][i]
![Page 19: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/19.jpg)
Can you do it your self?
v22=1v12=1
v11=1v21=-1
w1=-1 w2=1
h2
H2
h1
H1
oO
I1=1 I2=1
€
Δwij = −ε ⋅∂E
∂wij;Δv jk = −ε ⋅
∂E
∂v jk∂E
∂wij= (Oi − ti)⋅ g'(oi)⋅H j
∂E
∂v jk= g'(h j )⋅ Ik ⋅ (Oi
i
∑ − ti)⋅ g'(oi)⋅wij
g'(x) = (1− g(x))⋅ g(x)
O = g(o)
What is the output (O) from the network?What are the Δwij and Δvjk values if the target value is 0 and =0.5?
![Page 20: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/20.jpg)
Can you do it your self (=0.5). Has the error decreased?
v22=1v12=1
v11=1v21=-1
w1=-1 w2=1
h2=H2=
h1=
H1=
o=O=
I1=1 I2=1
€
Δw1 = ??Δw2 = ??
€
Δv11 = ??Δv12 = ??
Δv21 = ??
Δv22 = ??
v22=.
v12=V11=
v21=
w1= w2=
h2=H2=
h1=H1=
o=O=
I1=1 I2=1
Before After
![Page 21: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/21.jpg)
Sequence encoding
€
∂E∂v jk
= g'(h j )⋅ Ik ⋅ δ i⋅i
∑ wij = x[1][ j]⋅ (1− x[1][ j])⋅ Ik ⋅ δ i⋅i
∑ wij
• Change in weight is linearly dependent on input value
• “True” sparse encoding is therefore highly inefficient
• Sparse is most often encoded as– +1/-1 or 0.9/0.05
![Page 22: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/22.jpg)
Training and error reduction
![Page 23: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/23.jpg)
Training and error reduction
![Page 24: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/24.jpg)
Training and error reduction
Size matters
![Page 25: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/25.jpg)
• A Network contains a very large set of parameters
– A network with 5 hidden neurons predicting binding for 9meric peptides has more than 9x20x5=900 weights
• Over fitting is a problem• Stop training when test performance is optimal
Neural network training
yearsTem
pera
ture
![Page 26: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/26.jpg)
What is going on?
years
Tem
pera
ture
![Page 27: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/27.jpg)
Examples
Train on 500 A0201 and 60 A0101 binding dataEvaluate on 1266 A0201 peptides
NH=1: PCC = 0.77NH=5: PCC = 0.72
![Page 28: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/28.jpg)
Neural network training. Cross validation
120%
220%
320%
420%
520%
Cross validation
Train on 4/5 of dataTest on 1/5=>Produce 5 different neural networks each with a different prediction focus
![Page 29: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/29.jpg)
Neural network training curve
Maximum test set performanceMost cable of generalizing
![Page 30: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/30.jpg)
5 fold training
0.7
0.75
0.8
0.85
0.9
0.95
syn.00 syn.01 syn.02 syn.03 syn.04
Pearsons correlation
Train
Test
Which network to choose?
![Page 31: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/31.jpg)
5 fold training
0.7
0.75
0.8
0.85
0.9
0.95
syn.00 syn.01 syn.02 syn.03 syn.04 ens
Pearson correlation
Train
Test
Eval
![Page 32: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/32.jpg)
How many folds?
• Cross validation is always good!, but how many folds?– Few folds -> small training data sets– Many folds -> small test data sets
• Example from Tuesdays exercise– 560 peptides for training
• 50 fold (10 peptides per test set, few data to stop training)
• 2 fold (280 peptides per test set, few data to train)
• 5 fold (110 peptide per test set, 450 per training set)
![Page 33: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/33.jpg)
Problems with 5fold cross validation
• Use test set to stop training, and test set performance to evaluate training– Over-fitting?
• If test set is small yes• If test set is large no• Confirm using “true” 5 fold cross
validation– 1/5 for evaluation– 4/5 for 4 fold cross-validation
![Page 34: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/34.jpg)
Conventional 5 fold cross validation
![Page 35: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/35.jpg)
“True” 5 fold cross validation
![Page 36: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/36.jpg)
When to be careful
• When data is scarce, the performance obtained used “conventional” versus “true” cross validation can be very large
• When data is abundant the difference is small, and “true” cross validation might even be higher than “conventional” cross validation due to the ensemble aspect of the “true” cross validation approach
![Page 37: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/37.jpg)
Do hidden neurons matter?
• The environment matters
NetMHCpan
![Page 38: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/38.jpg)
Context matters
• FMIDWILDA YFAMYGEKVAHTHVDTLYVRYHYYTWAVLAYTWY 0.89 A0201• FMIDWILDA YFAMYQENMAHTDANTLYIIYRDYTWVARVYRGY 0.08 A0101• DSDGSFFLY YFAMYGEKVAHTHVDTLYVRYHYYTWAVLAYTWY 0.08 A0201• DSDGSFFLY YFAMYQENMAHTDANTLYIIYRDYTWVARVYRGY 0.85 A0101
![Page 39: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/39.jpg)
Example
![Page 40: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/40.jpg)
Summary
• Gradient decent is used to determine the updates for the synapses in the neural network
• Some relatively simple math defines the gradients– Networks without hidden layers can be solved on
the back of an envelope (SMM exercise)– Hidden layers are a bit more complex, but still ok
• Always train networks using a test set to stop training– Be careful when reporting predictive performance
• Use “true” cross-validation for small data sets
• And hidden neurons do matter (sometimes)
![Page 41: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/41.jpg)
And some more stuff for the long cold winter nights
• Can it might be made differently?
€
E =1
α⋅(O− t)α
![Page 42: Artificial Neural Networks 2 Morten Nielsen BioSys, DTU](https://reader036.vdocuments.us/reader036/viewer/2022062320/56649d1e5503460f949f1c1a/html5/thumbnails/42.jpg)
Predicting accuracy
• Can it be made differently?
€
E =1
2⋅(O1 − t)
2 ⋅O2 + λ ⋅(1−O2)
Reliability