lr1. summary day 1

18
Morning class summary Mercè Martín BigML

Upload: machine-learning-valencia

Post on 07-Feb-2017

315 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Morning class summary

Mercè Martín

BigML

Day 1

State of the Art in ML

• History

• Machine Learning problems and Tasks➔ Supervised Learning: Classi$cation, Regression, Multi-label classi$cation

➔ Unsupervised Learning: Clusters, Anomaly Detectors

➔ Semi-supervised Learning: Inference from partially labeled

• Features: numeric, categorical, date-time, text

text analysis: frequency-weighted bag of words

Poul Petersen (BigML)

Explicit rules

Di1cult to $nd and re-train

Explicit rules

Di1cult to $nd and re-train

Explicit rules

Di1cult to $nd and re-train

Implicit rules(data rules)Easy to re-train

• Technology

• Teaching computers to learn:

too general vs. too speci$c (under-$tting vs. over-$tting)

Missing values handling: new category, averages, mutiple choices

State of the Art in ML

Storage

low prices, big data

APIsCombination andaccessibility

CloudComputationalpower

Predictive APIs

• Supervised learning:

Classi$cation (output in a set of classes)

Regression (output is a number)

• Unsupervised learning: no output info

• Training / Test separation: partioning data, boostrap or

cross-validation

• Classi$cation: Confusion Matrix 

Evaluating ML Algorithms

Cèsar Ferri (UPV)

• Classi$cation metrics: Accuracy, Precision, Recall, F-measure

Extending to multi-class problems (averaging)

• Regression metrics: Mean Absolute error

Mean Squared error (more sensitive to extreme errors)

Root Mean Squared Error

Normalized for classi$ers comparison:

Relative Mean Squared Error

Relative Mean Absolute error

R2

• Unsupervised evaluation: no estimations, association rules,

support

• Clustering: distance and shape based evaluation (border, centers, distribution)

Evaluating ML Algorithms

Cèsar Ferri (UPV)

• History

• Classi$cation and Regression Trees

Structure where data is repeatedly separated in groups according to attribute values to minimize error / maximize information gain (split criterion: gini impurity)

Decision Trees

Gonzalo Martinez (UAM)

Expert BasedSystems

Human experts' rules

Automatized Knowledge Acquisition

Mining archives of cases (scalable)MYCIN: 600 rules

XCON: 2500 rules Rules:CHAID, CART, ID3, C4.5

Decision Trees

Automatized Knowledge Acquisition

Mining archives of cases

MYCIN: 600 rulesXCON: 2500 rules

CHAID, CART, ID3, C4.5

PROs

● Convertible to rules

● Categorical and numeric attributes

● Handle uninformative or redundant attributes

● Handle missing values

● Non-parametric (no prede$ned idea of concept to learn)

● Easy to tune (small number of parameters)

CONs

● Complex features interactions● Replication problem

Decision TreesPredicatesRules are based on the split predicatesMissing valuesOblique splits (compare features) Stopping criteriaAll instances in one class

No split found

Small number of instances

Gain below threshold

Maximum depth

PruningTo avoid over-$ttingCART is slower (more trees needed, avoids complexity)C4.5 faster but no con$dence threshold (avoids small nodes)

Parameters Number of

nodes, depth, pruning (on/oD and con$dence), minimum number of instances to split

Ensembles of Decision Trees

Gonzalo Martinez (UAM)

• Ensembles of models

Randomizing to decrease errors and over-$tting: data, features or algorithms

New Instance: x

1 1 2 1 2 11

Combined with voting or non-voting strategies (aggregators)

Best overall performance (SVN)

Almost parameter-less

On trees, very fast to train and test

Slower than a single classifier (mitigated with pruning)

Ensembles of Decision Trees

• Robust

• Improves error

• Parallelizable

Original datasetBootstrap sample 1

Repeated example

Removed example

Bootstrap sample T

BAGGING

Ensembles of Decision Trees

BOOSTING

Original datasetIteration 1

Iteration 2

Good average generalization error

Not robust (noise)

Can increment error of the base classifier

Not parallelizable

Ensembles of Decision Trees

• Robust

• Improves error

• Parallelizable

• Better than boosting

• Very fast to train

Original datasetBootstrap sample 1

Repeated example

Removed example

Random feature subset

Bootstrap sample T

RANDOM FORESTS

Ensembles of Decision Trees

CLASS SWITCHING

Original datasetRandomnoise 1

Randomnoise T

p=30%

Can improve results for cases wherenormal decision trees are not specially good

• Human knowledge used to compensate data problems: broken data (remove corner cases, defaults), missing

values (have meaning), reduce complexity (grouping classes), distances

• Discretization: signi$cant bins against concrete values

• Delta: diDerence or distance between features can be signi$cant

• Standarization: Mean of zero and standard deviation of one

• Normalizing: Feature vectors with unit norm

• Windowing: Previous points distributed in time

Data Transformations and FECharles Parker (BigML)

• Projections: Combining to have a new feature basis (lowering

dimensionality)

New axis: Principal component analysis

Keep neighbours: Spectral embeddings , Combination methods (Large Margin Nearest Neighbor, Xing’s Method)

• Sparsity: compressing sparse text and images data by sampling and

grouping

Data Transformations and FE

• Sub-sampling and Over-sampling: Restore balance by

eliminating over-sampled categories or giving higher weight to under-represented categories

• Evaluating Unbalanced Datasets Good accuracy is not enough. Look at precision and recall

Precision vs. Recall trade-oD: you must de$ne the cost for each

(letting out positives against letting in negatives)

Unbalanced Datasets

Poul Petersen (BigML)

Fraud Not Fraud0

750

1500

2250

3000

3750

• Automatic balancing: equal representation per class

• Weighting: Which instances are more important. Adds new

information to the dataset. Per class or per instance.

Unbalanced Datasets