ics summer school 2016 data science - week 4 gabriella

46
ICS Summer School 2016 Data Science - Week 4 Gabriella Contardo LIP6, University Pierre et Marie Curie, Paris, France August 10, 2016 1/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Upload: others

Post on 26-May-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ICS Summer School 2016 Data Science - Week 4 Gabriella

ICS Summer School 2016Data Science - Week 4

Gabriella Contardo

LIP6, University Pierre et Marie Curie, Paris, France

August 10, 2016

1/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 2: ICS Summer School 2016 Data Science - Week 4 Gabriella

Outline of the week

Course 1 : Reminders of the learning paradigm, neural networks /multi layer perceptron.Course 2 : Deep learning : Convolutional Neural NetworksCourse 3 : Tips on deep-learning - PCA, Matrix Factorizationand Recommender systemsCourse 4 : Unsupervised learning - Unsupervised Learning :Clustering (K-Means), EMCourse 5 : Unsupervised learning with (deep) neural networks :Auto-encoders, RNN. Word embeddings.

2/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 3: ICS Summer School 2016 Data Science - Week 4 Gabriella

References

On-line course material for todayThanks to:

Patrick Gallinari - Professor at UPMC - Course ”ApprentissageStatistique”Nicolas Baskiotis - Assistant Professor at UPMC - Course ”ARF”(Master DAC)Fei-Fei Li’s course at Stanford : CS231n: Convolutional NeuralNetworks for Visual Recognition (Lecture : introduction to neuralnets, backpropagation)Course Y.Bengio (MLSS 2014 slides and video available online)

Also interesting (generally speaking):Machine Learning course by Andrew Ng on CourseraLectures by Nando De Freitas (Oxford) - videos available

3/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 4: ICS Summer School 2016 Data Science - Week 4 Gabriella

Outline of the day

Deep LearningTips on Deep Learning

Unsupervised Learning - Data Compression / Representation

PCAMF/NMF and recommender systems

4/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 5: ICS Summer School 2016 Data Science - Week 4 Gabriella

GoogLeNet

Erratum : Google’s paper, not by LeCun (he’s at Facebook) butnamed as a reference/joke to his (old first) model ”LeNet”.”Yellow layers” : additional supervised layers added for training”on the side” parts of the networks. ”Later control experimentshave shown that the effect of the auxiliary networks is relativelyminor (around 0.5%) and that it required only one of them toachieve the same effect.” (cf paper)Learning time :”a rough estimate suggests that the GoogLeNetnetwork could be trained to convergence using few high-endGPUs within a week, the main limitation being the memory usage”More info in the paper :http://www.cs.unc.edu/ wliu/papers/GoogLeNet.pdf

5/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 6: ICS Summer School 2016 Data Science - Week 4 Gabriella

Tips for learning deep (convolutional) networks

Data Augmentation

Modify input (pixels) without changing label

Train on transformed data

(credit Fei-Fei Li’s course CS231n oxford)

Page 7: ICS Summer School 2016 Data Science - Week 4 Gabriella

Tips for learning deep(convolutional) networks

Data Augmentation

Flip

Random crops/scales

(+sample on crop)

Color jitter (e.g contrast)

Translation, rotation, ...

(credit Fei-Fei Li’s course CS231n oxford)

Page 8: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Augmentation

Flip

Random crops/scales

(+sample on crop)

Color jitter (e.g contrast)

Translation, rotation, ...

(credit Fei-Fei Li’s course CS231n oxford)

Straightforward for images but for

other types of data ?

Tips for learning deep(convolutional) networks

Page 9: ICS Summer School 2016 Data Science - Week 4 Gabriella

Generally :

Training : add random noise

Testing : marginalize over the noise

(credit Fei-Fei Li’s course CS231n oxford)

Tips for learning deep(convolutional) networks

Page 10: ICS Summer School 2016 Data Science - Week 4 Gabriella

« You need a lot of data if you want to train/use CNN »

→ Nope ! (well, not necessarily…) « Transfer » learning

(credit Fei-Fei Li’s course CS231n oxford)

Tips for learning deep(convolutional) networks

Page 11: ICS Summer School 2016 Data Science - Week 4 Gabriella

(credit Fei-Fei Li’s course CS231n oxford)

Tips for learning deep(convolutional) networks

« You need a lot of data if you want to train/use CNN »

→ Nope ! (well, not necessarily…)

Page 12: ICS Summer School 2016 Data Science - Week 4 Gabriella

(credit Fei-Fei Li’s course CS231n oxford)

Tips for learning deep(convolutional) networks

Using pre-trained CNN is the norm

Page 13: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction

ProblemMore dimensions⇒ more expressivityToo many dimensions : low variance on a dimension, noise, takesmemory space and time...For tasks on images, text, and other : too many dimensions→dimensions that are not very informative, highly correlated⇒ Lower dimensions but without losing informationGoal : find a projection Φ : Rd → Rd ′

with d ′ << dApplications :

LearningVisualizationNoise reduction

credit N.Baskiotis UPMC

6/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 14: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression

(credit Andrew Ng coursera)

Reduce data from 2D to 1D

Page 15: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression

(credit Andrew Ng coursera)

Reduce data from 2D to 1D

Page 16: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression

(credit Andrew Ng coursera)

Reduce data from 3D to 2D

Original datasetProjected dataset

Visualization of projected Dataset in 2D

z=[z1, z2] , zi = [z

i 1,z

i 2]

Page 17: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression

(credit http://setosa.io/ev/principal-component-analysis/)

Motivation : Visualization

Page 18: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression

(credit http://setosa.io/ev/principal-component-analysis/)

Motivation : Visualization

17D to 1D →

17D to 2D :

Page 19: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction

Quizz !Suppose we apply dimensionality reduction to a dataset of mexamples {x1, . . . , xm} where x i ∈ Rn. As a result, we will get out :

A lower dimensional dataset {z1, . . . , zk} of k examples wherek ≤ nA lower dimensional dataset {z1, . . . , zk} of k examples wherek > nA lower dimensional dataset {z1, . . . , zm} of m examples wherez i ∈ Rk for some value of k and k ≤ nA lower dimensional dataset {z1, . . . , zm} of m examples wherez i ∈ Rk for some value of k and k > n

credit Andrew Ng - Stanford Univ

7/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 20: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression - PCA

Principal Component Analysis (PCA) : Problem formulation

=> Goal : find a line on which projected the data

(credit Andrew Ng coursera)

Page 21: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA

FormulationReduce from 2-dimension to 1-dimension : Find a direction (avector u1 ∈ Rn) onto which to project he data so as to minimizethe projection error.Generic case : Reduce from n-dimension to k-dimension : find kvectors u1, . . . ,uk (directions) onto which to project the data so asto minimize the projection error.

credit Andrew Ng - Stanford Univ

8/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 22: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression - PCA

PCA is not linear regression

(credit Andrew Ng coursera)

Page 23: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression - PCA

Quiz : Suppose you run PCA on the following dataset. Which of the following would be a reasonnable vector u1 onto which to project the data ? (Usually ||u1||=1)

(credit Andrew Ng coursera)

u1 = [ 1 0 ]

u1 = [ 0 1 ]

u1 = [ 1/√2 1/√2 ]

u1 = [ -1/√2 1/√2 ]

Page 24: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Data preprocessing

Training set x1, . . . , xm

Preprocessing : feature scaling / mean normalization:Compute µj = 1

m

∑mi=1 x i

j

Replace each x ij with x i

j − µj : each feature has 0 mean.If different features on different scales, rescale features to havecomparable range of values (e.g x i

j =x i

j −min(xj

max(xj )−min(xj )).

To do also in supervised learning ! credit Andrew Ng - Stanford Univ

9/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 25: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Reduce data from n-dimensions to k-dimensionsCompute covariance matrix :

Σ =1m

n∑i=1

(x i)(x i)T

(Python : numpy.cov(x) if x is your matrix of examples withexamples in column (i.e ∈ Rn×n)).Compute eigenvectors of matrix Σ :

U,S,V = numpy.linalg.svd(Σ)W, V = numpy.linalg.eig(Σ)

The matrix U (or V if using eig) is ∈ Rn×n

Takes the first k columns of U→ Ureduce

Finding the new representation z i for an example x i :

z i = UTreduce × x i

credit Andrew Ng - Stanford Univ

10/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 26: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Quiz

In PCA, we obtain z ∈ Rk from x ∈ Rn as follow :

z i = UTreducek × x i

Which of the following is a correct expression for z ij ?

z ij = (uk )T x i

z ij = (uj)T x i

j

z ij = (uj)T x i

k

z ij = (uj)T x i

credit Andrew Ng - Stanford Univ

11/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 27: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Choosing number of components kAverage squared projection error =errorprojection = 1

m∑m

i=1 ||xi − x iapprox ||2

x iapprox = Ureducez : projection of x i .

Variation in data : 1m∑m

i=1 ||x i ||2 : distance on average of myexamples to origin.Choose k smallest value such that ratio between average squareprojection error and variation is small :errorprojection

variation ≤ 0.01”99% of variance retained”

credit Andrew Ng - Stanford Univ

12/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 28: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Algorithm to find k

Compute Σ,UTry PCA with k=1

Compute Ureduce, z1, . . . , zm, x1approx , . . . , xm

approxCheck if ratio is inferior to 0.01Increment k if criterion is not met, stop otherwise.

Speeding up by using U,S, V = svd(Σ). S ∈ Rn×n, diagonal matrix Sii .For a given value of k , the ratio can be computed as :

1−∑k

i=1 Sii∑ni=1 Sii

⇒ Don’t need to recompute all z and xapprox .

credit Andrew Ng - Stanford Univ

13/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 29: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Quiz

We said that PCA chooses k directions u1, . . . ,uk onto which toproject the data so as to minimize the squared projection error.Another way to say the same is that PCA tries to minimize

1m∑m

i=1 ||x i ||21m∑m

i=1 ||x iapprox ||2

1m∑m

i=1 ||x i − x iapprox ||2

1m∑m

i=1 ||x i + x iapprox ||2

credit Andrew Ng - Stanford Univ

14/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 30: ICS Summer School 2016 Data Science - Week 4 Gabriella

Dimensionality Reduction - PCA algorithm

Tips on PCA

Dataset (x1, y1), . . . , (xm, ym), with x i ∈ R10000

↪→ (z1, y1), . . . , (zm, ym) with z i ∈ R1000

PCA on training dataset (compute Ureduce, finding k etc.), appliedon train, validation and test.Less dimensions→ less space and models afterward with fewerparameters (e.g neural networks)If visualization : k=2 or 3

Don’tDon’t use PCA to prevent overfitting, use regularization instead.Don’t run PCA before even testing on your raw data !

credit Andrew Ng - Stanford Univ

15/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 31: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data Compression PCA : ortogonal base → no redundancy

Images : made of little « basic » parts :

Can we learn a dictionnary of such patchs to represent data ?

(credit N.Baskiotis)

Page 32: ICS Summer School 2016 Data Science - Week 4 Gabriella

Matrix Factorization

(credit P.Gallinari)

Idea Project data vectors in a latent space of dimension k < m size of

the original space Axis in this latent space represent a new basis for data

representation Each original data vector will be approximated as a linear

combination of k basis vectors in this new space

Page 33: ICS Summer School 2016 Data Science - Week 4 Gabriella

Matrix Factorization

(credit P.Gallinari)

 

X U V

Page 34: ICS Summer School 2016 Data Science - Week 4 Gabriella

Matrix Factorization

(credit P.Gallinari)

Applications Recommendation (User x Item matrix)

Matrix completion

Link prediction (Adjacency matrix) …

Page 35: ICS Summer School 2016 Data Science - Week 4 Gabriella

Matrix Factorization

(credit P.Gallinari)

X V

x.jv.j

u.1u.2u.3

Original data Basis vectorsDictionnary

Representation

x

v.j

u.1

u.2

u.3

Page 36: ICS Summer School 2016 Data Science - Week 4 Gabriella

Matrix Factorization

(credit P.Gallinari)

Interpretation If X is a User x Item matrix

Users and items are represented in a common representation space of size k

Their interaction is measured by a dot product in this space

x.jv.j

ui.

Original data User Representation

Item Representation

xUsers

Items

Page 37: ICS Summer School 2016 Data Science - Week 4 Gabriella

Matrix Factorization

(credit P.Gallinari)

Interpretation If X is a directed graph adjacency or weight matrix

x.jv.j

ui.

Original data Sender Representation

Receiver _Representation

xNodes

Nodes

Page 38: ICS Summer School 2016 Data Science - Week 4 Gabriella

Recommender Systems

Predicting movie ratings

(credit Andrew Ng coursera)

Page 39: ICS Summer School 2016 Data Science - Week 4 Gabriella

Recommender Systems

Collaborative Filtering

(credit Andrew Ng coursera)

Page 40: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Collaborative filtering optimization objective

Mimize J with ui , v j , with i ∈ 1, . . . ,nm (items e.g movie), andj ∈ 1, . . . ,nu (users) :J (u1, . . . ,unm , v1, . . . , vnu ) = 1/2

∑(i,j):r(i,j)=1(uiv j − y ij)2 +

λ/2∑nm

i=1∑n

k=1(uik )2 + λ/2

∑nuj=1

∑nk=1(v j

k )2

16/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 41: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Algorithm

Initialize u1, . . . ,unm , v1, . . . , vnu randomlyMinimize J (u1, . . . ,unm , v1, . . . , vnu ) using gradient descent e.g

uik = ui

k − ε(∑

(i,j):r(i,j)=1(uiv j − y ij )v jk + λui

k )

v jk = v j

k − ε(∑

(i,j):r(i,j)=1(uiv j − y ij )uik + λv j

k )

(here : batch→ all ratings for i or j are used.)Predicting a rating for a user j with learned representation v j andan item i with learned representation ui : uiv j

N.B : possible to use alternate gradient descent or other.

17/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 42: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Loss function -example

Minimize C = ||X − UV ||2 + c(U,V )

Constraints on U,V via c(U,V ) e.g :Positivity (NMF - next slides)Sparsity of representations, e.g ||V ||1Over-complete dictionnary U : k > nSymmetryBias on U and V (e.g bias on popularity for item recommendation)Any a priori knowledge on U and V

credit P.Gallinari UPMC

18/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 43: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Non Negative Matrix Factorization

Minimize C = ||X − UV ||2 under constraints U,V ≥ 0Convex loss in U and in V but not in both U and V.Algorithm can be solved by a Lagrangian formulation :

Iterative multiplicative algorithmU,V initialized at random valuesIterate until convergence :

uij ← uijXV T

ij(UVV T )i j

vij ← vij(XT U)ij

(V T UT U)ij

Or by projected gradient formulations.Solution U,V is not unique : if U,V solution, then UD,D−1V for Ddiagonal positive is also solution,

credit P.Gallinari UPMC

19/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 44: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Using NMF for ClusteringNormalize U as a column stochastic matrix : each column vectoris of norm 1

uij ←uij√∑

i u2ij

vij ← vij

√∑i u2

ij

Under the constraint ”U normalized”, the solution U,V is uniqueAssociate x i to cluster j if j = argmaxj(vij)

credit P.Gallinari UPMC

20/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 45: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Many different version and extentions of NMFDifferent loss functions (e.g different constraints)Different algorithmsApplications:

ClusteringLink predictionRecommendationetc

credit P.Gallinari UPMC

21/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Page 46: ICS Summer School 2016 Data Science - Week 4 Gabriella

Data compression - Matrix Factorization

Many different version and extentions of NMFDifferent loss functions (e.g different constraints)Different algorithmsApplications:

ClusteringLink predictionRecommendationetc

credit P.Gallinari UPMC

22/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning