sdforum 11-04-2010

Apache Mahout

Thursday, November 4, 2010

Apache MahoutNow with extra whitening and classification powers!

• Mahout intro

• Scalability in general

• Supervised learning recap

• The new SGD classifiers

Mahout?

• Hebrew for “essence”

• Hindi for a guy who drives an elephant

Mahout?

Mahout!

• Scalable data-mining and recommendations

• Not all data-mining

• Not the fanciest data-mining

• Just some of the scalable stuff

• Not a competitor for R or Weka

General Areas

• Recommendations

• lots of support, lots of flexibility, production ready

• Unsupervised learning (clustering)

• lots of options, lots of flexibility, production ready (ish)

General Areas

• Supervised learning (classification)

• multiple architectures, fair number of options, somewhat inter-operable

• production ready (for the right definition of production and ready)

• Large scale SVD

• larger scale coming, beware sharp edges

Scalable?

• Scalable means

• Time is proportional to problem size by resource size

• Does not imply Hadoop or parallel

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

WallClockTime

# of Training Examples

Scalable Algorithm(Mahout wins!)

Traditional Datamining Works here

Scalable Solutions Required

Non-scalable Algorithm

Scalable means ...

• One unit of work requires about a unit of time

• Not like the company store (bit.ly/22XVa4)

BRIEF ARTICLE

THE AUTHOR

t ∝ |P ||R|

|P | = O(1) =⇒ t = O(1)

WallClockTime

# of Training Examples

Parallel Algorithm

Sequential Algorithm Preferred

Parallel Algorithm Preferred

Sequential Algorithm

Toy Example

Training Data Sample

no 0.92 0.01 circle

0.30 0.41 square

Filled?

x coordinate y coordinate shape

predictor variables

target variable

What matters most?

SGD Classification

• Supervised learning of logistic regression

• Sequential gradient descent, not parallel

• Highly optimized for high dimensional sparse data, possibly with interactions

• Scalable, real dang fast to train

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xn

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xn

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Supervised Learning

T x1 ... xnT x1 ... xn

T x1 ... xn

LearningAlgorithm

? x1 ... xn? x1 ... xn

? x1 ... xn

Sequential but fast

Stateless, parallel

Small example

• On 20 newsgroups

• converges in < 10,000 training examples (less than one pass through the data)

• accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes

• learning rate, regularization set automagically on held-out data

System Structure

EvolutionaryProcess epvoid train(target, features)

AdaptiveLogisticRegression

OnlineLogisticRegression foldsvoid train(target, tracking, features)double auc()

CrossFoldLearner

Matrix betavoid train(target, features)double classifyScalar(features)

OnlineLogisticRegression

Training API

public interface OnlineLearner {

void train(int actual, Vector instance);

void train(long trackingKey, int actual, Vector instance);

void train(long trackingKey, String groupKey, int actual, Vector instance);

void close();}

Classification APIpublic class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close();

public double auc(); public State<Wrapper> getBest();}

CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood();

double p = model.classifyScalar(features);

Speed?

• Encoding API for hashed feature vectors

• String, byte[] or double interfaces

• String allows simple parsing

• byte[] and double allows speed

• Abstract interactions supported

Speed!

• Parsing and encoding dominate single learner

• Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core

• 20 million mixed text, categorical features with many interactions learned in ~ 1 hour

More Speed!

• Evolutionary optimization of learning parameters allows simple operation

• 20x threading allows high machine use

• 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes

Summary

• Mahout provides early production quality scalable data-mining

• New classification systems allow industrial scale classification

Contact Info

Ted Dunningtdunning@maprtech.com

Contact Info

Ted Dunningtdunning@maprtech.com

or tdunning@apache.com

sdforum 11-04-2010

xnt x1

xnalgorithmt x1

supervised learningt

xn learningmodeltx1

xnmodel t

scalable datamining

general supervised learning

interactions scalable

Technology

openface sdforum sam sig preso

04. hybridization 2010

sydkorset 04, 2010

sdforum developing in a service-oriented world...messaging...

december 4 sdforum java sig presentation

ccxc doc. - internet use (apr 2010 - jul 2010) · 2017. 12....

horse racing form guide 04-04-2010

10199_bi.indd 110199_bi.indd 1 23/04/2010 3:50 pm23/04/2010...

04/20/2010 - home - speedball art · 04/20/2010. 04/20/2010...

1 11/04/2010 phenix weekly planning 11/04/2010 don lynch

2010 04 14_yuri_chemistry_laboratory_safety_rules

quarterly report 04/10 - julian stair · quarterly report...

sunday school lesson 2010 04 04

catalogo 04-2010

04 apr 2010

2010-04-15 it...

friendsters @ work (sdforum)

mylifebits jim gemmell & gordon bell sdforum distinguished...

maclife 2010-04

Здоровье 04 2010