introduction to machine learning · understanding machine learning: from theory to algorithms....

Introduction to Machine LearningNeural Networks

Bhaskar Mukhoty, Shivam Bansal

Indian Institute of Technology KanpurSummer School 2019

June 4, 2019

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 1 / 30

Lecture Outline

Neural Networks

Backpropagation Algorithm

Convolution NN

Recurrent NN


Recap

Linear models: Learn a linear hypothesis function h in theinput/attribute space X .

Kernelized models: Map inputs φ(x) from attribute space X tofeature space F and learn a linear hypothesis function h in the featurespace.


Neural Networks

A neural network consists of an input layer, an output layer and oneor more hidden layers.

Each node in a hidden layer computes a nonlinear transform of inputsit receives.


Neural network with single hidden layer

Each input xn transformed intoseveral ”pre-activations” usinglinear models,

ank = wTk xn =

D∑d=1

wdkxnd

Non-linear activation appliedon each pre-activation,

hnk = g(ank)


Neural network with single hidden layer

A linear model applied on thenew features hn,

sn = vThn =K∑

k=1

vkhnk

Finally, the output is producedas yn = o(sn).

The overall effect is anon-linear mapping frominputs to outputs.


Neural Network


Fully-connected Feedforward Neural Network

Fully-connected: All pairs of nodes between adjacent layers areconnected to each other.

Feedforward: No backward connections. Also, only adjacent layernodes are connected.


Neural networks are feature learners

A NN tries to learn features that can predict the output well.


Neural Networks as Feature Learners

Figure: [Zeiler and Fergus, 2014]


Learning Neural Networks via Backpropagation

Backpropogation is Gradient Descent using chain rule of derivatives.

Chain rule of derivatives: Example, if y = f1(x) and x = f2(z) then∂y∂z = ∂y

∂x∂x∂z .


Learning Neural Networks via Backpropagation

Backpropagation iterates between a forward pass and a backwardpass.

Forward pass computes the errors using the current parameters.

Backward pass computes the gradients and updates the parameters,starting from the parameters at the top layer and then movingbackwards.

Using Backpropagation in neural nets enables us to reuse previouscomputations efficiently.


Activation Functions

Sigmoid: h = σ(a) = 11+exp(−a)

tanh(tan hyperbolic): h = exp(a)−exp(−a)exp(a)+exp(−a) = 2σ(2a)− 1

ReLU(Rectified Linear Unit): h = max(0, a)

Leaky ReLU: h = max(βa, a), where β is small positive number



Sigmoid, tanh can have issues with saturating gradients.

If weights are too large, the gradient for weights is close to zero andlearning becomes slow or may stop.

Pic credit: Andrej KarpathyBhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 14 / 30


ReLU activation function have dead ReLU problem.

If the weights are initialized such that output of node is 0, thegradient for weights is zero and the node never fires.

Pic credit: Andrej KarpathyBhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 15 / 30

Preventing overfitting in Neural Networks

Weight decay: l1 or l2 regularization on the weights.

Early stopping: Stop when validation error starts increasing.

Dropout: Randomly remove units (with some probability p ∈ (0, 1))during training.


Convolution Neural Network

CNNs are feedforward neural networks.

Weights are shared among the connections.

The set of distinct weights defines a filter or local feature detector.


Convolution

An operation that captures spatially local patterns in the input.

Usually several filters {W k}Kk=1 are applied each producing a separatefeature map.

These filters are learned usign backpropagation.


Pooling

An operation that reduces the dimension of input.

Pooling operation is fixed before and not learned.

Popular pooling approaches: Max-pooling, average pooling


Convolution Neural Network


Modeling sequential data

Example of sequential data: Videos, text, speechFFNN on a single observation xn

FFNN on sequential data x1, ..., xT

For sequential data, we want dependencies between ht ’s of differentobservations.


Recurrent Neural Networks

A neural network for sequential data.

Each hidden state ht = f (Wxt + Uht−1) where U is a K × K matrixand f some activation function.


Different types of RNN

Both input and output can be sequences of different lengths.


Backpropagation through time

Think of the time-dimension as another hidden layer and then it isjust like standard backpropagation for feedforward neural nets.


RNN Limitation

Vanishing or exploding gradients: Repeated multiplication can causegradients to vanish or explode.

Weak Long-term dependency: Repeated composition of functionscause the sensitivity of hidden states on a given part of input tobecome weaker as we move along the sequence.


Long Short-Term Memory

An RNN with hidden nodes having gates to remember or forgetinformation.

Open gate denoted by ’o’ and closed gate denoted by ’-’.

Minor variations of LSTM exists depending on gates used, eg. GRU.


Gated Recurrent Unit (Simplified)

RNN computes hidden states as

ht = tanh (Wxt + Uht−1)

.

For RNN state update is multiplicative (weak memory and gradientissues).

GRU computes hidden states as

h̃t = tanh (Wxt + Uht−1)

Γu = σ(Pxt + Qht−1)

ht = Γu × h̃t + (1− Γu)× ht−1

For GRU state update is additive.


Questions?


References I

Andrew Ng (2019).Sequence models.https://www.coursera.org/learn/nlp-sequence-models.

Carter, S. (2019).Visualize feed-forward neural network.https://playground.tensorflow.org/.

Kar, P. (2017).Introduction to machine learning.https://web.cse.iitk.ac.in/users/purushot/courses/ml/

2017-18-a.


https://www.coursera.org/learn/nlp-sequence-models

https://playground.tensorflow.org/

https://web.cse.iitk.ac.in/users/purushot/courses/ml/2017-18-a

https://web.cse.iitk.ac.in/users/purushot/courses/ml/2017-18-a

References II

Rai, P. (2018).Introduction to machine learning.https://www.cse.iitk.ac.in/users/piyush/courses/ml_

autumn18/index.html.

Shalev-Shwartz, S. and Ben-David, S. (2014).Understanding machine learning: From theory to algorithms.Cambridge university press.

Zeiler, M. D. and Fergus, R. (2014).Visualizing and understanding convolutional networks.In European conference on computer vision, pages 818–833. Springer.


https://www.cse.iitk.ac.in/users/piyush/courses/ml_autumn18/index.html

https://www.cse.iitk.ac.in/users/piyush/courses/ml_autumn18/index.html

introduction to machine learning · understanding machine learning: from theory to algorithms....

Documents