introduction to machine learning · understanding machine learning: from theory to algorithms....

Post on 11-Jul-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Machine LearningNeural Networks

Bhaskar Mukhoty, Shivam Bansal

Indian Institute of Technology KanpurSummer School 2019

June 4, 2019

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 1 / 30

Lecture Outline

Neural Networks

Backpropagation Algorithm

Convolution NN

Recurrent NN

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 2 / 30

Recap

Linear models: Learn a linear hypothesis function h in theinput/attribute space X .

Kernelized models: Map inputs φ(x) from attribute space X tofeature space F and learn a linear hypothesis function h in the featurespace.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 3 / 30

Neural Networks

A neural network consists of an input layer, an output layer and oneor more hidden layers.

Each node in a hidden layer computes a nonlinear transform of inputsit receives.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 4 / 30

Neural network with single hidden layer

Each input xn transformed intoseveral ”pre-activations” usinglinear models,

ank = wTk xn =

D∑d=1

wdkxnd

Non-linear activation appliedon each pre-activation,

hnk = g(ank)

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 5 / 30

Neural network with single hidden layer

A linear model applied on thenew features hn,

sn = vThn =K∑

k=1

vkhnk

Finally, the output is producedas yn = o(sn).

The overall effect is anon-linear mapping frominputs to outputs.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 6 / 30

Neural Network

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 7 / 30

Fully-connected Feedforward Neural Network

Fully-connected: All pairs of nodes between adjacent layers areconnected to each other.

Feedforward: No backward connections. Also, only adjacent layernodes are connected.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 8 / 30

Neural networks are feature learners

A NN tries to learn features that can predict the output well.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 9 / 30

Neural Networks as Feature Learners

Figure: [Zeiler and Fergus, 2014]

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 10 / 30

Learning Neural Networks via Backpropagation

Backpropogation is Gradient Descent using chain rule of derivatives.

Chain rule of derivatives: Example, if y = f1(x) and x = f2(z) then∂y∂z = ∂y

∂x∂x∂z .

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 11 / 30

Learning Neural Networks via Backpropagation

Backpropagation iterates between a forward pass and a backwardpass.

Forward pass computes the errors using the current parameters.

Backward pass computes the gradients and updates the parameters,starting from the parameters at the top layer and then movingbackwards.

Using Backpropagation in neural nets enables us to reuse previouscomputations efficiently.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 12 / 30

Activation Functions

Sigmoid: h = σ(a) = 11+exp(−a)

tanh(tan hyperbolic): h = exp(a)−exp(−a)exp(a)+exp(−a) = 2σ(2a)− 1

ReLU(Rectified Linear Unit): h = max(0, a)

Leaky ReLU: h = max(βa, a), where β is small positive number

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 13 / 30

Activation Functions

Sigmoid, tanh can have issues with saturating gradients.

If weights are too large, the gradient for weights is close to zero andlearning becomes slow or may stop.

Pic credit: Andrej KarpathyBhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 14 / 30

Activation Functions

ReLU activation function have dead ReLU problem.

If the weights are initialized such that output of node is 0, thegradient for weights is zero and the node never fires.

Pic credit: Andrej KarpathyBhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 15 / 30

Preventing overfitting in Neural Networks

Weight decay: l1 or l2 regularization on the weights.

Early stopping: Stop when validation error starts increasing.

Dropout: Randomly remove units (with some probability p ∈ (0, 1))during training.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 16 / 30

Convolution Neural Network

CNNs are feedforward neural networks.

Weights are shared among the connections.

The set of distinct weights defines a filter or local feature detector.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 17 / 30

Convolution

An operation that captures spatially local patterns in the input.

Usually several filters {W k}Kk=1 are applied each producing a separatefeature map.

These filters are learned usign backpropagation.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 18 / 30

Pooling

An operation that reduces the dimension of input.

Pooling operation is fixed before and not learned.

Popular pooling approaches: Max-pooling, average pooling

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 19 / 30

Convolution Neural Network

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 20 / 30

Modeling sequential data

Example of sequential data: Videos, text, speechFFNN on a single observation xn

FFNN on sequential data x1, ..., xT

For sequential data, we want dependencies between ht ’s of differentobservations.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 21 / 30

Recurrent Neural Networks

A neural network for sequential data.

Each hidden state ht = f (Wxt + Uht−1) where U is a K × K matrixand f some activation function.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 22 / 30

Different types of RNN

Both input and output can be sequences of different lengths.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 23 / 30

Backpropagation through time

Think of the time-dimension as another hidden layer and then it isjust like standard backpropagation for feedforward neural nets.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 24 / 30

RNN Limitation

Vanishing or exploding gradients: Repeated multiplication can causegradients to vanish or explode.

Weak Long-term dependency: Repeated composition of functionscause the sensitivity of hidden states on a given part of input tobecome weaker as we move along the sequence.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 25 / 30

Long Short-Term Memory

An RNN with hidden nodes having gates to remember or forgetinformation.

Open gate denoted by ’o’ and closed gate denoted by ’-’.

Minor variations of LSTM exists depending on gates used, eg. GRU.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 26 / 30

Gated Recurrent Unit (Simplified)

RNN computes hidden states as

ht = tanh (Wxt + Uht−1)

.

For RNN state update is multiplicative (weak memory and gradientissues).

GRU computes hidden states as

h̃t = tanh (Wxt + Uht−1)

Γu = σ(Pxt + Qht−1)

ht = Γu × h̃t + (1− Γu)× ht−1

For GRU state update is additive.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 27 / 30

Questions?

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 28 / 30

References I

Andrew Ng (2019).Sequence models.https://www.coursera.org/learn/nlp-sequence-models.

Carter, S. (2019).Visualize feed-forward neural network.https://playground.tensorflow.org/.

Kar, P. (2017).Introduction to machine learning.https://web.cse.iitk.ac.in/users/purushot/courses/ml/

2017-18-a.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 29 / 30

References II

Rai, P. (2018).Introduction to machine learning.https://www.cse.iitk.ac.in/users/piyush/courses/ml_

autumn18/index.html.

Shalev-Shwartz, S. and Ben-David, S. (2014).Understanding machine learning: From theory to algorithms.Cambridge university press.

Zeiler, M. D. and Fergus, R. (2014).Visualizing and understanding convolutional networks.In European conference on computer vision, pages 818–833. Springer.

Bhaskar Mukhoty, Shivam Bansal ( Indian Institute of Technology Kanpur Summer School 2019 )Introduction to Machine Learning June 4, 2019 30 / 30

top related