image classification and retrieval on spark

27
SPARK MBUTO Design & Engineering Machine Learning Pipelines Gianvito Siciliano Use Case: Image Classification and Retrieval

Upload: gianvito-siciliano

Post on 15-Apr-2017

328 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Image Classification and Retrieval on Spark

SPARK MBUTODesign & Engineering Machine Learning Pipelines

Gianvito Siciliano

Use Case: Image Classification and Retrieval

Page 2: Image Classification and Retrieval on Spark

OUTLINE1. Spark ‘Mbuto intro

2. ML problems overview

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

Page 3: Image Classification and Retrieval on Spark

OUTLINE1. Spark ‘Mbuto intro

• Abstractions

• Basic Examples

2. ML problems overview

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

Page 4: Image Classification and Retrieval on Spark

SPARK MBUTO• Spark poc to (easy) create, run and test pipelines and

workflow

• Pipelines are made by sequential steps in a SparkJobApp

• Each steps is a SparkJob

• Each job share the same Spark/SQL context

• Jobs are consecutively run by JobRunner

Page 5: Image Classification and Retrieval on Spark

SPARKJOB

Page 6: Image Classification and Retrieval on Spark

JOBRUNNER

Page 7: Image Classification and Retrieval on Spark

SPARKJOBAPP

Page 8: Image Classification and Retrieval on Spark

PIPELINE

App .main

JobRunner .run

Job

Job

.execute

.execute

next job

Page 9: Image Classification and Retrieval on Spark

JOB READY TO USE

Page 10: Image Classification and Retrieval on Spark

READABLE APP

App .main

JobRunner .run

Job

Job

.execute

.execute

next job

Page 11: Image Classification and Retrieval on Spark

PERFORMANCE LOOKUP

A

JobR

J

J

Page 12: Image Classification and Retrieval on Spark

OUTLINE1. Spark ‘Mbuto intro

2. ML problems overview

• Classification

• Retrieval

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

Page 13: Image Classification and Retrieval on Spark

IMAGE CLASSIFICATION• Multiclass image classification:

1. Choose model (NN, SVM, TREE…)

2. Train/test model (with labeled images)

3. Predict the label of new images

4. Tune the model

Page 14: Image Classification and Retrieval on Spark

IMAGE RETRIEVAL• Multiclass image classification:

1. Choose metric (Euclidean, cosine…)

2. Build dictionary

3. Train/test the model

4. Query and search

5. Tune the model

Page 15: Image Classification and Retrieval on Spark

WHAT CHANGES?

• Pipelines architecture

• Classification logic

• How to update the model?

Page 16: Image Classification and Retrieval on Spark

CLASSIFICATION PIPELINE

DATA

TRAIN CLASSIFIER

MODELNEW DATA

PREDICTION

Page 17: Image Classification and Retrieval on Spark

RETRIEVAL PIPELINE

DATA

TRAIN CLASSIFIER

MODEL QUERY

PREDICTION

Page 18: Image Classification and Retrieval on Spark

OUTLINE1. Spark ‘Mbuto intro

2. ML problems overview

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

Page 19: Image Classification and Retrieval on Spark

CLASSIFICATION & RETRIEVAL• Keypoints extraction from each images

• Clustering on the keypoints universe

• Represent each image with weighted cluster vector

• Train & Test the model

• Query the model (finding the most similar images)

Features Engineering

Build the Dictionary

Build theclassifier

Query the model

Page 20: Image Classification and Retrieval on Spark

C. & R. JOBS• Load whole dataset

• Extract keypoints

• Reduce the keypoints universe

• Transform the features space

• Create the dictionary (aka Codebook)

• Train, test & evaluate the classifier

• Query and get prediction

DATA

TRAIN CLASSIFIER

MODEL

PREDICTION

Page 21: Image Classification and Retrieval on Spark

KMeansCLASSIFIER

ImageLOADER

.transform

SiftEXTRACTOR

KMeansQUANTISER

.fit

CLUSTERS

CfIifTRANSFORMER

ClusterVectorPIVOTER

CODEBOOK

Features Engineering

Build the Dictionary

DICTIONARY

TRANSFORMER

ESTIMATOR

Page 22: Image Classification and Retrieval on Spark

VectorASSEMBLER

.transform

LabelINDEXER

KNNCLASSIFIER

.fit

.transform

.fit

KMeansCLASSIFIER

TRAIN TEST

.split

EVALUATOR

Trainclassifier

Evaluateclassifier

INSAMPLE PREDICTION

OUTSAMPLE PREDICTION

CLASSIFIER

TRANSFORMER

ESTIMATOR

Page 23: Image Classification and Retrieval on Spark

KNN IMPLEMENTATION• Is a comparison model: the similarity metric is crucial!

• Nearest Neighbour search (in the codebook) is the panic point:

• KDTree: not parallel (anche se…)

• LSH: hyperparams difficult to tune

• Metric Tree: disjoint features points area

• Spill tree: too many shared points

=> Hybrid Tree

Page 24: Image Classification and Retrieval on Spark

HYBRID TREE• TopTree is a Metric tree

• SubLeaf Tree are Spill tree, trained in parallel

• Nodes can be:

• OVERLAP => defeatist search

• NON OVERLAP => backtracking

Page 25: Image Classification and Retrieval on Spark

NEURAL NETWORK

• Convolutional works well with images

• Hyperparameters tuning is the panic point, but can be automatised (guarda il nuovo algo)

• Training is not trivial, update the model is easy to complain

Page 26: Image Classification and Retrieval on Spark

WHAT MORE?• Features engineering

• Hyperparameters tuning

• Parallel optimizations

• Persist/update steps

• Ensemble models

DATA

Combiner

PREDICTION

Normalizer

pipelineModel

Cross Validator

Page 27: Image Classification and Retrieval on Spark

https://github.com/gianvi

Thanks!