introduction to machine learning @ mooncascade ml camp

Post on 07-Jan-2017

357 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

by Ilya Kuzovkin ilya.kuzovkin@gmail.com

Mooncascade ML Camp 2016

Machine LearningESSENTIAL CONCEPTS

ONE MACHINE LEARNING USE CASE

Can we ask a computer to create those patterns

automatically?

Can we ask a computer to create those patterns

automatically?

Yes

Can we ask a computer to create those patterns

automatically?

Yes

How?

Raw data

Instance Raw dataClass (label)A data sample:

“7”

Instance Raw dataClass (label)A data sample:

“7”

How to represent it in a machine-readable form?

Instance Raw dataClass (label)A data sample:

“7”

How to represent it in a machine-readable form?

Feature extraction

Instance Raw dataClass (label)A data sample:

“7”

How to represent it in a machine-readable form?

Feature extraction

28 p

x

28 px

Instance Raw dataClass (label)A data sample:

“7”

28 p

x

28 px784 pixels in total

Feature vector(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

How to represent it in a machine-readable form?

Feature extraction

Instance Raw dataClass (label)A data sample:

“7”

28 p

x

28 px784 pixels in total

Feature vector(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

How to represent it in a machine-readable form?

Feature extraction

(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

(0, 0, 0, …, 13, 48, 102, 0, 46, 255,… 0, 0, 0)

(0, 0, 0, …, 17, 34, 12, 43, 122, 70,… 0, 7, 0)

(0, 0, 0, …, 98, 21, 255, 255, 231, 140,… 0, 0, 0)

“7”“2”

“8”“2”

Instance Raw dataClass (label)A data sample:

“7”

28 p

x

28 px784 pixels in total

Feature vector(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

How to represent it in a machine-readable form?

Feature extraction

(0, 0, 0, …, 28, 65, 128, 255, 101, 38,… 0, 0, 0)

(0, 0, 0, …, 13, 48, 102, 0, 46, 255,… 0, 0, 0)

(0, 0, 0, …, 17, 34, 12, 43, 122, 70,… 0, 7, 0) Dataset(0, 0, 0, …, 98, 21, 255, 255, 231, 140,… 0, 0, 0)

“7”“2”

“8”“2”

The data is in the right format — what’s next?

The data is in the right format — what’s next?• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata

• AODE• Artificialneuralnetwork• Backpropagation• NaiveBayesclassifier• Bayesiannetwork• Bayesianknowledgebase• Case-basedreasoning• Decisiontrees• Inductivelogicprogramming• Gaussianprocessregression• Geneexpressionprogramming• Groupmethodofdatahandling(GMDH)• LearningAutomata• LearningVectorQuantization• LogisticModelTree• Decisiontree• Decisiongraphs• Lazylearning• MonteCarloMethod• SARSA

• Instance-basedlearning• NearestNeighborAlgorithm• Analogicalmodeling• Probablyapproximatelycorrectlearning(PACL)• Symbolicmachinelearningalgorithms• Subsymbolicmachinelearningalgorithms• Supportvectormachines• RandomForest• Ensemblesofclassifiers• Bootstrapaggregating(bagging)• Boosting(meta-algorithm)• Ordinalclassification• Regressionanalysis• Informationfuzzynetworks(IFN)• Linearclassifiers• Fisher'slineardiscriminant• Logisticregression• NaiveBayesclassifier• Perceptron• Supportvectormachines• Quadraticclassifiers• k-nearestneighbor• Boosting

Pick an algorithm

The data is in the right format — what’s next?• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata

• AODE• Artificialneuralnetwork• Backpropagation• NaiveBayesclassifier• Bayesiannetwork• Bayesianknowledgebase• Case-basedreasoning• Decisiontrees• Inductivelogicprogramming• Gaussianprocessregression• Geneexpressionprogramming• Groupmethodofdatahandling(GMDH)• LearningAutomata• LearningVectorQuantization• LogisticModelTree• Decisiontree• Decisiongraphs• Lazylearning• MonteCarloMethod• SARSA

• Instance-basedlearning• NearestNeighborAlgorithm• Analogicalmodeling• Probablyapproximatelycorrectlearning(PACL)• Symbolicmachinelearningalgorithms• Subsymbolicmachinelearningalgorithms• Supportvectormachines• RandomForest• Ensemblesofclassifiers• Bootstrapaggregating(bagging)• Boosting(meta-algorithm)• Ordinalclassification• Regressionanalysis• Informationfuzzynetworks(IFN)• Linearclassifiers• Fisher'slineardiscriminant• Logisticregression• NaiveBayesclassifier• Perceptron• Supportvectormachines• Quadraticclassifiers• k-nearestneighbor• Boosting

Pick an algorithm

DECISION TREE

vs.

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

PIXEL #417

>200 <200

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

PIXEL #417

>200 <200

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

PIXEL #123

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

PIXEL #123

<100 >100

PIXEL #123

DECISION TREE

vs.

(0, …, 28, 65, …, 207, 101, 0, 0)

(0, …, 19, 34, …, 254, 54, 0, 0)

(0, …, 87, 59, …, 240, 52, 4, 0)

(0, …, 87, 52, …, 240, 19, 3, 0)

(0, …, 28, 64, …, 102, 101, 0, 0)

(0, …, 19, 23, …, 105, 54, 0, 0)

(0, …, 87, 74, …, 121, 51, 7, 0)

(0, …, 87, 112, …, 239, 52, 4, 0)

PIXEL #417

>200 <200

<100 >100

PIXEL #123

DECISION TREE

DECISION TREE

ACCURACY

ACCURACY

Confusion matrix

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =

correctly classified

total number of samples

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =

correctly classified

total number of samples

Beware of an imbalanced dataset!

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =

correctly classified

total number of samples

Beware of an imbalanced dataset!

Consider the following model: “Always predict 2”

True

cla

ss

Predicted class

ACCURACY

Confusion matrix

acc =

correctly classified

total number of samples

Beware of an imbalanced dataset!

Consider the following model: “Always predict 2”

Accuracy 0.9

True

cla

ss

Predicted class

DECISION TREE

DECISION TREE

“You said 100% accurate?! Every 10th digit your system detects is wrong!”

Angry client

DECISION TREE

“You said 100% accurate?! Every 10th digit your system detects is wrong!”

Angry client

We’ve trained our system on the data the client gave us. But our system has never seen the new data the client applied it to.

And in the real life — it never will…

OVERFITTING

Simulate the real-life situation — split the dataset

OVERFITTING

Simulate the real-life situation — split the dataset

OVERFITTING

Simulate the real-life situation — split the dataset

OVERFITTING

Simulate the real-life situation — split the dataset

Underfitting!“Too stupid” OK Overfitting!

“Too smart”

OVERFITTING

Underfitting!“Too stupid” OK Overfitting!

“Too smart”

OVERFITTING

Our current decision tree has too much capacity, it just has memorized all of the data.

Let’s make it less complex.

You probably did not notice, but we are overfitting again :(

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

!• Estimate overfitting

TRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

!• Estimate overfitting

TRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

!• Estimate overfitting

TRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

!• Estimate overfitting

TRAVALITRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

!• Estimate overfitting

TRAVALITRAVALITRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

Fit various models and parameter combinations on this subset

• Evaluate the models created with different parameters

!• Estimate overfitting

Use only once to get the final performance estimate

TRAVALITRAVALITRAVALITRAVALITRAVALI

TEST SET 20%

TRAINING SET 60%

VALIDATION SET 20%

TEST SET 20%

TRAINING SET 60%

VALIDATION SET 20%

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

What if we got too optimistic validation set?

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

What if we got too optimistic validation set?

TRAINING SET 80%

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

What if we got too optimistic validation set?

TRAINING SET 80%

Fix the parameter value you ned to evaluate, say msl=15

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

What if we got too optimistic validation set?

TRAINING SET 80%

Fix the parameter value you ned to evaluate, say msl=15

TRAINING VAL

TRAINING VAL

TRAININGVAL

Repeat 10 times

CROSS-VALIDATION

TRAINING SET 60%

THE WHOLE DATASET

VALIDATION SET 20%

What if we got too optimistic validation set?

TRAINING SET 80%

Fix the parameter value you ned to evaluate, say msl=15

TRAINING VAL

TRAINING VAL

TRAININGVAL

Repeat 10 times } Take average validation score over 10 runs — it is a more stable estimate.

MACHINE LEARNING PIPELINE

Take raw data Extract features Split into TRAINING and TEST

Pick an algorithm and parameters

Train on the TRAINING data

Evaluate on the TRAINING data

with CV

Train on the whole TRAINING

Fix the best parameters

Evaluate on TESTReport final

performance to the client

Try our different algorithms and parameters

MACHINE LEARNING PIPELINE

Take raw data Extract features Split into TRAINING and TEST

Pick an algorithm and parameters

Train on the TRAINING data

Evaluate on the TRAINING data

with CV

Train on the whole TRAINING

Fix the best parameters

Evaluate on TESTReport final

performance to the client

Try our different algorithms and parameters

“So it is ~87%…erm… Could you do better?”

MACHINE LEARNING PIPELINE

Take raw data Extract features Split into TRAINING and TEST

Pick an algorithm and parameters

Train on the TRAINING data

Evaluate on the TRAINING data

with CV

Train on the whole TRAINING

Fix the best parameters

Evaluate on TESTReport final

performance to the client

Try our different algorithms and parameters

“So it is ~87%…erm… Could you do better?”

Yes

• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata

• AODE• Artificialneuralnetwork• Backpropagation• NaiveBayesclassifier• Bayesiannetwork• Bayesianknowledgebase• Case-basedreasoning• Decisiontrees• Inductivelogicprogramming• Gaussianprocessregression• Geneexpressionprogramming• Groupmethodofdatahandling(GMDH)• LearningAutomata• LearningVectorQuantization• LogisticModelTree• Decisiontree• Decisiongraphs• Lazylearning• MonteCarloMethod• SARSA

• Instance-basedlearning• NearestNeighborAlgorithm• Analogicalmodeling• Probablyapproximatelycorrectlearning(PACL)• Symbolicmachinelearningalgorithms• Subsymbolicmachinelearningalgorithms• Supportvectormachines• RandomForest• Ensemblesofclassifiers• Bootstrapaggregating(bagging)• Boosting(meta-algorithm)• Ordinalclassification• Regressionanalysis• Informationfuzzynetworks(IFN)• Linearclassifiers• Fisher'slineardiscriminant• Logisticregression• NaiveBayesclassifier• Perceptron• Supportvectormachines• Quadraticclassifiers• k-nearestneighbor• Boosting

Pick another algorithm

• C4.5• Randomforests• Bayesiannetworks• HiddenMarkovmodels• Artificialneuralnetwork• Dataclustering• Expectation-maximizationalgorithm• Self-organizingmap• Radialbasisfunctionnetwork• VectorQuantization• Generativetopographicmap• Informationbottleneckmethod• IBSEAD• Apriorialgorithm• Eclatalgorithm• FP-growthalgorithm• Single-linkageclustering• Conceptualclustering• K-meansalgorithm• Fuzzyclustering• Temporaldifferencelearning• Q-learning• LearningAutomata

• AODE• Artificialneuralnetwork• Backpropagation• NaiveBayesclassifier• Bayesiannetwork• Bayesianknowledgebase• Case-basedreasoning• Decisiontrees• Inductivelogicprogramming• Gaussianprocessregression• Geneexpressionprogramming• Groupmethodofdatahandling(GMDH)• LearningAutomata• LearningVectorQuantization• LogisticModelTree• Decisiontree• Decisiongraphs• Lazylearning• MonteCarloMethod• SARSA

• Instance-basedlearning• NearestNeighborAlgorithm• Analogicalmodeling• Probablyapproximatelycorrectlearning(PACL)• Symbolicmachinelearningalgorithms• Subsymbolicmachinelearningalgorithms• Supportvectormachines• RandomForest• Ensemblesofclassifiers• Bootstrapaggregating(bagging)• Boosting(meta-algorithm)• Ordinalclassification• Regressionanalysis• Informationfuzzynetworks(IFN)• Linearclassifiers• Fisher'slineardiscriminant• Logisticregression• NaiveBayesclassifier• Perceptron• Supportvectormachines• Quadraticclassifiers• k-nearestneighbor• Boosting

Pick another algorithm

RANDOM FOREST

RANDOM FORESTDecision tree:

pick best out of all features

RANDOM FORESTDecision tree:

pick best out of all featuresRandom forest:

pick best out of random subset of features

RANDOM FOREST

RANDOM FOREST

pick best out of another random subset of features

RANDOM FOREST

pick best out of another random subset of features pick best out of yet another

random subset of features

RANDOM FOREST

RANDOM FOREST

RANDOM FOREST

class

instance

RANDOM FOREST

class

instance

RANDOM FOREST

class

instance

RANDOM FOREST

class

instance

Happy client

ALL OTHER USE CASES

Sound

Frequency components Genre Bag of

words Topic

Text

Pixel values

Image

Cat or dog

Video

Frame pixels

Walking or running

Database records Biometric data

Census data

Average salary … Dead or

alive

HANDS-ON SESSION

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

top related