machine learning basics1 - utstat.toronto.edu€¦ · properties of that distribution, while...

Machine Learning Basics1

STA442/2101 Fall 2018

1See last slide for copyright information.1 / 32

Source

Chapter 5 in Deep Learning by Goodfellow, Bengio andCourville

I have copy-pasted so many quotes from this text that this slideshow fits the definition of plagiarism.

2 / 32

Machine learning is a form of applied statistics

Machine learning is essentially a form of appliedstatistics with increased emphasis on the use ofcomputers to statistically estimate complicatedfunctions and a decreased emphasis on provingconfidence intervals around these functions.

3 / 32

Learning

A computer program is said to learn from experience Ewith respect to some class of tasks T and performancemeasure P, if its performance at tasks in T, asmeasured by P, improves with experience E.

The term learning can also mean model fitting or parameterestimation.

4 / 32

Tasks

Machine learning tasks are usually described in termsof how the machine learning system should process anexample. An example is a collection of features thathave been quantitatively measured from some object orevent that we want the machine learning system toprocess.

Example = datausually what we would call a case

Feature = variable

5 / 32

Examples of common tasks T

Classification: In this type of task, the computer programis asked to specify which of k categories some input belongsto. Can be withor without missing inputs (medicaldiagnosis).

Regression: In this type of task, the computer program isasked to predict a numerical value given some input.

Transcription: In this type of task, the machine learningsystem is asked to observe a relatively unstructuredrepresentation of some kind of data and transcribe it intodiscrete, textual form. For example, in optical characterrecognition, . . .

Machine translation: In a machine translation task, theinput already consists of a sequence of symbols in somelanguage, and the computer program must convert thisinto a sequence of symbols in another language.

6 / 32

More Examples of tasks

Anomaly detection: In this type of task, the computerprogram sifts through a set of events or objects, and flagssome of them as being unusual or atypical. An example ofan anomaly detection task is credit card fraud detection.To me, detection of credit card fraud is a classificationproblem. Another example of anomoly detection is outlierdetection.

Imputation of missing values.

Denoising: The machine learning algorithm is given asinput a corrupted example obtained by an unknowncorruption process from a clean example. The learner mustpredict the clean example from its corrupted version.Sounds like measurement error modeling.

Density estimation or probability mass function estimation.

7 / 32

Performance measure P

Performance measures are usually specific to the task.Accuracy is an example.

“We often refer to the error rate as the expected 0-1 loss.The 0-1 loss on a particular example is 0 if it is correctlyclassified and 1 if it is not.”

Usually we are interested in how well the machinelearning algorithm performs on data that it has notseen before, since this determines how well it will workwhen deployed in the real world. We therefore evaluatethese performance measures using a test set of datathat is separate from the data used for training themachine learning system.

8 / 32

Experience E

Machine learning algorithms can be broadly categorizedas unsupervised or supervised by what kind ofexperience they are allowed to have during the learningprocess. Most of the learning algorithms in this bookcan be understood as being allowed to experience anentire dataset.

Experience = to process data?

9 / 32

Supervised versus unsupervised learning

Roughly speaking, unsupervised learning involvesobserving several examples of a random vector x, andattempting to implicitly or explicitly learn theprobability distribution p(x), or some interestingproperties of that distribution, while supervised learninginvolves observing several examples of a random vectorx and an associated value or vector y, and learning topredict y from x, usually by estimating p(y|x).

Supervised learning: There is a response variable. (Calleda “label” or “target.”)

Unsupervised learning: No response variable.

Cluster analysisPrincipal componentsDensity estimation

10 / 32

Design matrix

A design matrix is a matrix containing a differentexample in each row. Each column of the matrixcorresponds to a different feature. For instance, theIris dataset contains 150 examples with four featuresfor each example.

Example = case (there are n cases).

Feature = variable.

11 / 32

Regression example“A simple machine learning algorithm: linear regression.”

“The goal is to build a system that can take a vector x ∈ Rn asinput and predict the value of a scalar y ∈ R as its output.”

“We define the output to be y = w>x, where w ∈ Rn is a vectorof parameters.” ouch.

Test set will be used only for evaluation. Good.

“One way of measuring the performance of the model is tocompute the mean squared error of the model on the test set.”MSEtest

“Minimize the mean squared error on the training set,MSEtrain .”

Then they present the normal equations and say that evaluating

w =(X(train)>X(train)

)−1X(train)>y(train) “constitutes a

simple learning algorithm.” So “learning” is definitelyestimation, or at least curve fitting.

12 / 32

What is a “model?”

I know what a model is in statistics. It’s a set of assertionsthat implies a probability distribution for the observabledata.

In machine learning, the meaning is slippery – close butnot quite the same.

The “system that can take a vector x ∈ Rn as input andpredict the value of a scalar y ∈ R as its output” wouldprobably be called a model.

In statistics, this would be a combination of a model andan estimator.

13 / 32

Generalization

Generalization: The ability to perform well on previouslyunobserved inputs.

With access to a training set, we can compute some errormeasure on the training set called the training error, andwe reduce this training error. This is just optimization.

The generalization error is defined as the expected value ofthe error on a new input. Good, that’s clear.

We want the generalization error, also called the testerror, to be low as well. That is, we want generalizationerror as well as training error to be low.

There’s a theorem that generalization error is greater thanor equal to (expected) training error

14 / 32

Objectives

Make the training error small.

Make the gap between training and test error small.

15 / 32

Data generating distributionThe i.i.d. assumptions

To get anywhere, even the machine learning people have tomake some assumptions.

Assume training set and test set are independent.

Examples are independent within sets.

Both training set and test set come from a common datagenerating distribution distribution denoted pdata.

16 / 32

Over and underfittng

Underfitting occurs when the model is not able to obtain asufficiently low error value on the training set.

Overfitting occurs when the gap between the training errorand test error is too large.

Notice how empirical this is compared to the statisticalformulation.

17 / 32

Model capacity

A model’s capacity is its ability to fit a wide variety offunctions.“Models with low capacity may struggle to fit the trainingset.”“Models with high capacity can overfit by memorizingproperties of the training set that do not serve them wellon the test set.”Hypothesis space: The set of functions that the learningalgorithm is allowed to select as being the solution.Polynomial regression example.“The model specifies which family of functions the learningalgorithm can choose from when varying the parameters inorder to reduce a training objective.This is called therepresentational capacity of the model.”So the representational capacity appears to be determinedby the hypothesis space.

18 / 32

There are theoretical results

The most important results in statistical learningtheory show that the discrepancy between training errorand generalization error is bounded from above by aquantity that grows as the model capacity grows butshrinks as the number of training examples increases.

The authors note that these bounds are very loose and seldomused in practice.

19 / 32

k nearest neighbor regression

Capacity of the model grows with the size of the data set.

“This algorithm is able to achieve the minimum possibletraining error on any regression dataset.”

20 / 32

Bayes error

The ideal model is an oracle that simply knows the trueprobability distribution that generates the data.

The error incurred by an oracle making predictions fromthe true distribution p(x, y) is called the Bayes error.

I have no idea why.

21 / 32

The No Free Lunch Theorem

The no free lunch theorem for machine learning(Wolpert, 1996) states that, averaged over all possibledata generating distributions, every classificationalgorithm has the same error rate when classifyingpreviously unobserved points.

So you need to make some assumptions about the probabilitydistributions that might reasonably be encountered in practice.

22 / 32

Regularization

The no free lunch theorem implies that we must design ourmachine learning algorithms to perform well on a specifictask.

Choose functions in the hypothesis space well.

We can also give a learning algorithm a preference for onesolution in its hypothesis space over another.

This is called regularization.

Weight decay example in regression: Minimize

J(w) = MSEtrain + λw>w

I don’t know about the crazy vocabulary, but there is somefreedom here that is harder to find in standard statisticalpractice.

23 / 32

Hyperparameters

Most machine learning algorithms have several settingsthat we can use to control the behavior of the learningalgorithm.

These settings are called hyperparameters.

The values of hyperparameters are not adapted by thelearning algorithm itself.

This is very different from the meaning of hyperparameteras I understand it.

24 / 32

Validation setA useful concept

Want to search around in the hypothesis space to locate thebest model, but that could result in over-fitting.

Can’t peek at the test data.

Split the training data, yes the training data, into twodisjoint subsets.

One of these subsets is used to learn the parameters.

The other subset is our validation set, used to estimate thegeneralization error during or after training, allowing forthe hyperparameters to be updated accordingly.

The subset of data used to learn the parameters is stilltypically called the training set, even though this may beconfused with the larger pool of data used for the entiretraining process.

25 / 32

k-fold cross validation

You don’t have enough data to split into a training set anda test set.

Split into k disjoint subsets. Try to predict each one inturn using the rest of the data as a training sample.

Average the results.

The variance of such an average is hard to estimate well.

26 / 32

Standard statistical ideasSometimes with a strange twist

Sometimes there really are unknown parameters and theydo maximum likelihood.

They still refer to estimation as prediction, most of thetime.

“In machine learning experiments, it is common to say thatalgorithm A is better than algorithm B if the upper boundof the 95% confidence interval for the error of algorithm Ais less than the lower bound of the 95% confidence intervalfor the error of algorithm B.”

Why not just test?

27 / 32

Bayesian Statistics

Pretty standard treatment.

Predictive density: Integrate out the parameter using theposterior distribution.

Regression example with a conjugate prior: “The Bayesianestimate provides a covariance matrix, showing how likelyall the different values of w are, rather than providing onlya the estimate . . . ”

“Maximum A Posteriori (MAP) Estimation” meansestimate using the posterior mode.

28 / 32

Support vector machines

Method for binary classification.

It looks pretty strong. This is new to me, I think.

Predict Yes for test data x when w>x + b is positive.

Predict No when w>x + b is negative.

Replace w>x with b+∑m

i=1 αiφ(x) · φ(x(i)),

Where x(i) is a vector of training data and the dot productis very general.

29 / 32

Stochastic Gradient Descent

A recurring problem in machine learning is that largetraining sets are necessary for good generalization, butlarge training sets are also more computationally expensive.

The cost function used by a machine learning algorithmoften decomposes as a sum over training examples of someper-example loss function.The minus log likelihood is a sumover observations.

The sample size can be huge.

Calculating the big sum (of derivatives) can take a lot ofcomputation.

So just compute the gradient on a random sample.Random = stochastic.

Go downhill, a little randomly.

It’s no big deal, but why not just do the whole task on arandom sample of the data?

30 / 32

Recipe for a machine learning algorithm

Combine

Dataset

Cost function

Optimization procedure

Model

31 / 32

Copyright Information

This slide show was prepared by Jerry Brunner, Department ofStatistical Sciences, University of Toronto. So many quotes arelifted from Deep Learning by Goodfellow et al. that thisdocument fits the definition of plagiarism. LATEX source code isavailable from the course website:http://www.utstat.toronto.edu/∼brunner/oldclass/appliedf18

32 / 32

http://www.utstat.toronto.edu/~brunner

http://www.utstat.toronto.edu/~brunner/oldclass/appliedf18

machine learning basics1 - utstat.toronto.edu€¦ · properties of that distribution, while...

Documents