optimizing supervised and implementing unsupervised machine learning algorithms in hpcc systems

46
Victor Herrera Maryam Najafabadi Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems

Upload: hpcc-systems

Post on 16-Apr-2017

184 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Victor HerreraMaryam Najafabadi

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems

Page 2: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 2

LexisNexis/Florida Atlantic University Cooperative Research

Developing ML Algorithms On

HPCC/ECL Platform HPCC

ECL ML Library

High LevelData Centric Declarative Language

Scalability Dictionary Approach

Open Source

LexisNexis HPCC

Platform

Big Data Management

Optimized Code

Parallel Processing

Machine Learning

Algorithms

Page 3: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Agenda

• Optimizing Supervised MethodsVictor Herrera

• Toward Deep LearningMaryam Najafabadi

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 3

Page 4: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Optimizing Supervised Methods

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 4

Page 5: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

ML-ECL Random Forest Optimization:• Decreased significantly the time for Learning and

Classification phases.• Improved Classification performance.

Working with Sparse Data:• Sparse ARFF reduced dataset representation.• Speed Up Naïve Bayes algorithm learning and

classification time on highly sparse datasets.

Overview

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 5

Page 6: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Random Forest (Breiman, Leo. 2001)Ensemble supervised learning algorithm for classification and regression.Operate by constructing a multitude of decision trees.

Main Idea:Most of the trees are good for most of the data and make mistakes in different places

How:DT Bagging - Rnd samples with replaceSplits over Rnd Selection of FeaturesMajority Voting

Why RF:Overcomes overfitting problemHandles wide, unbalanced class, and noisy data.Generally outperforms single algorithms.Good for parallelization.

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 6

Random Forest

Page 7: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Training Data

Decision Tree Learning Process

Independent Dataset

Dependent Dataset

Assign All Instances to Root Node

Tranining Data into Root Node

Iterative Split/Partition Process

Decision Tree Model

Transform to DecTree Model

Format

Purity, Max Tree Level

Purity, Max Tree Level

Instances into Dec Tree Layout

GrowTree( Training Data )

Is the Node Pure Enough?

ReturnLEAF Nodewith Label

YES

Find Best Attribute to Split

NO

SplitTraining Data in

Subsets Di

Calculate Node Purity

FOR EACHSubset Di

Children += GrowTree(Di)

Training Data in ROOT

Return SPLIT Node +

Children

NEXT

DONE

• Random Forest Learning is based on Recursive Partitioning as in Decision Trees.

• Forward References not allowed in ECL

• DecTree Learning implemented in ECL as an Iterative Process via LOOP(dataset, …,loopbody)

Recursive Partitioning as Iterative in ECL

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 7

Page 8: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Random Forest Learning Optimization

Initial Implementation Flaws:

• For every single iteration of Iterative Split/Partition LOOP at least K x N x M records are sent to the loopbody function:

• For each LOOP iteration every Node-Instance record pass to loopbody function regardless of whether its processing was completed or not.

• Wasting resources by including Independent data as part of loopbody function’s INPUT:

• Node Purity based only upon Dependent data

• Finding Best Split per Node only needs subsets of Independent data (Feature Selection)

• Implementation was not fully parallelized.

Rnd Forest Initial Implementation

Training Data

Bootstrap (Sampling)

Forest Growth

Independent Dataset

Dependent Dataset

Associate Original to

New ID Instances

Orig-New ID Instances

Hash Table

Assign Instances to Root Nodes

K Number of Trees

Tranining Data into Root Nodes

Iterative Split/Partition Process

S Number of Feat to Select

RF Model

Dataset SizeK x N x M records

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 8

Page 9: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Rnd Forest Optimized Implementation

Training Data

Bootstrap (Sampling)

Forest Growth

Independent Dataset

Dependent Dataset

Original – New Instances ID Hash Table

SamplingDependent

Training Data

K Number of Trees

Sampled Dependent Data

in Root Nodes

Sampled Independent

Dataset

Iterative Split/Partition

Process

RF Model

Fetch Required Sampled

Independent Data

Purity & Max Tree Level

Sampling Independent Training Data

Number of Features to Select

Dataset SizeK x N records

Dataset SizeK x M records

Random Forest Learning Optimization

Review of initial implementation helped to re-organize the process and data flows.

We improved our initial approach in order to:

• Filter records not requiring further processing (LOOP - rowfilter).

• Pass only one RECORD per instance (dependent value) into loopbody function.

• Fetch only Required Independent data from within the function at each iteration.

• Take full advantage of distributed data storage and parallel processing capabilities of the HPCC Systems Platform.

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 9

Page 10: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Random Forest Learning Optimization

Loopbody function fully parallelized:

• Receives and returns one RECORD per Instance.

• Node Impurity and Best Split per Node calculations done LOCAL-ly:

• Node-Instance Data DISTRIBUTED by Node_id.

• Fetching Rnd Feat Selection using JOIN-LOCAL:

• Sampled Independent data generated and DISTRIBUTED by inst. Id at BOOTSTRAP.

• Instances-Features Selected combinations dataset (RETRIEVER) DISTRIBUTED by inst. Id.

• Inst. Relocation to New Nodes done LOCAL-ly:

• Impure Node-Instance Data still DISTRIBUTED by Node_id.

• JOIN-LOOKUP with Split Nodes data.

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 10

RF Split/Partition loopbody FUNCTION Optimized

Filter Nodes by Impurity

Pure Enough Node-Inst

Data

Impure Nodes-Inst

Data

Fetch RndFeat

Selection

Choose Best Split per

Node

Split Nodes Data

Re-Assign Instances in New Nodes

New Node-Inst

Assign DataOUTPUT

INPUT

RndFeat Sel Nodes-Inst

Data

PURE

IMPURE

SampledIndependent

Dataset

Calculate Node Gini Impurity

Node’s Impurity

Data

Filter Nodes by Impurity

Page 11: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Random Forest Learning Optimization – Preliminary Results

Preliminary Comparison of Learning Time between Initial version (old) and Optimized Beta version (new):• Adult Dataset:

• Discrete dataset• 16281 instances * 13 feat + class• Balanced

• 6 Features Selected (HALF total)• Number of Trees: 25 , 50, 75 and 100• Depth: 10, 25, 50, 75, 100 and 125• 10 runs for each case

Preliminary Results gave us green light to complete the final optimized implementation:• Fully parallelized Learning Process• New Optimized Classification Process

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 11

Page 12: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Working with Sparse Data – NaïveBayes

Sparse matrix is a matrix in which most elements are zero.One way to reduce its dataset representation is usingSparse ARFF file format.

//ARFF file | //Sparse ARFF file | //Sparse Types.DiscreteField DS

| attr index starts in 0 | attr index starts in 1

@data | @data | //defValue:= 0, posclass:=1 ,

negclass:= 0

0, 0, 1, 0, 0, posclass | {2 1, 5 posclass} | [{1, 3, 1}, {1, 6, 1},

2, 0, 0, 0, 1, posclass | {0 2, 4 1, 5 posclass} | {2, 1, 2}, {2, 5, 1}, {2, 6, 1},

0, 0, 0, 0, 2, negclass | {4 2, 5 negclass} | {3, 5, 2}, {3, 6, 0}]

Instead of using 15 records to represent the data only 7 were enough.

0 0 1 0 0 posclass

2 0 0 0 1 posclass

0 0 0 0 0 negclass

NaiveBayes using Sparse ARFF:• Highly sparse datasets, as in Text Mining Bag of Words, are represented with a few records in ECL.• Save Disk/Memory space.• Extend the default value “0” to any value defined by DefValue:= value;• Accelerate Calculations based on DefValue’s frequency pre computations. Speed up both Learning

and Classification time.

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 12

Page 13: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Sparse Naïve Bayes – Results

Sentiment dataset (Bag of Words): • 1.6 millions instances x 109,735 feat + class• 175.57 billions of DiscreteField records

Original NaiveBayes classification using sub-samples of Sentiment Dataset:• 5% , job completed, in avg. a little more than 1 hour• 20% , job completed, in avg. around 4.5 hours• 50% , job completed, in avg. around 12.5 hours

SparseARFF format Sentiment dataset: • 1.6 e+6 lines, Between 1 to 30 Non Default values per line – Very High Sparsity• Assuming 15 values in avg: 1.6e+6 x 15 = 24 millions DiscreteField records• Default value “0”

SparseNaïveBayes using Sentiment equivalent SparseARFF dataset:• Classification Test done in just 70 seconds.• 10-Fold Cross Validation run takes only 6 minutes to finish

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 13

Page 14: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Summary

ML-ECL Random Forest Speed Up:

Learning Processing time reduction:• Reduction of R/W operations – Data passing simplification • Parallelization of loopbody function:

- Reorganization of data distribution and aggregations- Fetching Required Independent Data only

Classification Processing time reduction:• Implemented as iteration and fully parallelized.

Classification performance improvement:• Feature selection randomization upgraded to node level

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 14

Page 15: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Summary

Working with Sparse Data:

Functionality to work with Sparse ARFF format files in HPCC• Sparse-ARFF to DiscreteField function implementation

Implement Sparse Naïve Bayes Discrete classifier• Learning and classification phases fully operative• Highly Sparse Big Datasets processed in seconds

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 15

Page 16: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Toward Deep Learning

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 16

Page 17: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Overview

• Optimization algorithms on HPCC Systems

• Implementations based on the optimization algorithm

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 17

Page 18: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Mathematical optimization

• Minimizing/Maximizing a function

Minimum

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 18

Page 19: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Optimization Algorithms in Machine Learning

• The heart of many (most practical?) machine learning algorithms:• Linear regression

Minimize Errors

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 19

Page 20: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Optimization Algorithms in Machine Learning

• SVM

Maximize Margin

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 20

Page 21: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Optimization Algorithms in Machine Learning

• Collaborative filtering

• K-means

• Maximum likelihood estimation

• Graphical models

• Neural networks

• Deep Learning

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 21

Page 22: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Formulate Training as an Optimization Problem

• Training model: finding parameters that minimize some objective function

Define ParametersDefine an Objective

FunctionFind values for the parameters that

minimize the objective function

Cost term Regularization term

Optimization Algorithm

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 22

Page 23: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

How they work

Search Direction

Step Length

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 23

Page 24: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Gradient Descent

• Step length• Constant value

• Search direction• Negative gradient

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 24

Page 25: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Gradient Descent

• Step length• Constant value

• Search direction• Negative gradient

Small Step Length

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 25

Page 26: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Gradient Descent

• Step length• Constant value

• Search direction• Negative gradient

Large Step Length

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 26

Page 27: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Gradient Descent

• Step length• Constant value

• Search direction• Negative gradient

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 27

Page 28: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

L-BFGS

• Step length• Wolfe line search

• Search direction• Simple & Compact representation of Hessian Matrix

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 28

Page 29: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

L-BFGS

• Limited-memory -> only a few vectors of length n (instead of n by n )

• Useful for solving large problems (large n)

• More stable learning

• Uses curvature information to take a more direct route -> faster convergence

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 29

Page 30: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

How to use

• Define a function that calculates Objective value and Gradient

ObjectiveFunc (x, ObjectiveFunc_params, TrainData , TrainLabel)

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 30

Page 31: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

L-BFGS based Implementations on HPCC Systems

• Sparse Autoencoder

• Softmax

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 31

Page 32: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Sparse Autoencoder

• Autoencoder• Output is the same as the input

• Sparsity• constraint the hidden neurons to be inactive most of the time

• Stacking them up makes a Deep Network

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 32

Page 33: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Formulate to an optimization problem

• Parameters• Weight and bias values

• Objective function• Difference between output and expected output

• Penalty term to impose sparsity

• Define a function to calculate objective value and Gradient at a give point

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 33

Page 34: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Sparse Autoencoder results

• 10’000 samples of randomly 8*8 selected patches

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 34

Page 35: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Sparse Autoencoder results

• MNIST dataset

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 35

Page 36: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

SoftMax Regression

• Generalizes logistic regression

• More than two classes

• MNIST -> 10 different classes

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 36

Page 37: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Formulate to an optimization problem

• Parameters• K by n variables

• Objective function• Generalize logistic regression objective function

• Define a function to calculate objective value and Gradient at a give point

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 37

Page 38: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

SoftMax Results

• Test on MNIST data

• Using features extracted by Sparse Autoencoder

• 96% accuracy

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 38

Page 39: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Toward Deep Learning

• Provide learned features from one layer to another sparse autoencoder

• …. Stack up to build a deep network

• Fine tuning • Using forward propagation to calculate cost value and back propagation to

calculate gradients

• Use L-BFGS to fine tune

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 39

Page 40: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Take Advantages of HPCC Systems

• PBblas

• Graphs

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 40

Page 41: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Example

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 41

Page 42: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Example

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 42

Page 43: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Example

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 44

Page 44: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 45

Page 45: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

SUMMARY

• Optimization Algorithms an important aspect for advanced machine learning problems

• L-BFGS implemented on HPCC Systems• SoftMax

• Sparse Autoencoder

• Implement other algorithms by calculating objective value and gradient

• Toward deep learning

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 46

Page 46: Optimizing Supervised and Implementing Unsupervised Machine Learning Algorithms in HPCC Systems

Questions?

Thank You

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 47