sdforum 11-04-2010
Post on 15-Jan-2015
7.566 Views
Preview:
DESCRIPTION
TRANSCRIPT
Apache Mahout
Thursday, November 4, 2010
Apache MahoutNow with extra whitening and classification powers!
Thursday, November 4, 2010
• Mahout intro
• Scalability in general
• Supervised learning recap
• The new SGD classifiers
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout!
• Scalable data-mining and recommendations
• Not all data-mining
• Not the fanciest data-mining
• Just some of the scalable stuff
• Not a competitor for R or Weka
Thursday, November 4, 2010
General Areas
• Recommendations
• lots of support, lots of flexibility, production ready
• Unsupervised learning (clustering)
• lots of options, lots of flexibility, production ready (ish)
Thursday, November 4, 2010
General Areas
• Supervised learning (classification)
• multiple architectures, fair number of options, somewhat inter-operable
• production ready (for the right definition of production and ready)
• Large scale SVD
• larger scale coming, beware sharp edges
Thursday, November 4, 2010
Scalable?
• Scalable means
• Time is proportional to problem size by resource size
• Does not imply Hadoop or parallel
BRIEF ARTICLE
THE AUTHOR
t ∝ |P ||R|
1
Thursday, November 4, 2010
WallClockTime
# of Training Examples
Scalable Algorithm(Mahout wins!)
Traditional Datamining Works here
Scalable Solutions Required
Non-scalable Algorithm
Thursday, November 4, 2010
Scalable means ...
• One unit of work requires about a unit of time
• Not like the company store (bit.ly/22XVa4)
BRIEF ARTICLE
THE AUTHOR
t ∝ |P ||R|
|P | = O(1) =⇒ t = O(1)
1
Thursday, November 4, 2010
WallClockTime
# of Training Examples
Parallel Algorithm
Sequential Algorithm Preferred
Parallel Algorithm Preferred
Sequential Algorithm
Thursday, November 4, 2010
Toy Example
Thursday, November 4, 2010
Training Data Sample
yes
no 0.92 0.01 circle
0.30 0.41 square
Filled?
x coordinate y coordinate shape
predictor variables
target variable
Thursday, November 4, 2010
What matters most?
!
!
!
!!
!
!
!!
!
Thursday, November 4, 2010
SGD Classification
• Supervised learning of logistic regression
• Sequential gradient descent, not parallel
• Highly optimized for high dimensional sparse data, possibly with interactions
• Scalable, real dang fast to train
Thursday, November 4, 2010
Supervised Learning
T x1 ... xnT x1 ... xn
T x1 ... xnT x1 ... xn
T x1 ... xn
Model
Model
TT
TT
T
LearningAlgorithm
? x1 ... xn? x1 ... xn
? x1 ... xn? x1 ... xn
? x1 ... xn
Thursday, November 4, 2010
Supervised Learning
T x1 ... xnT x1 ... xn
T x1 ... xnT x1 ... xn
T x1 ... xn
Model
Model
TT
TT
T
LearningAlgorithm
? x1 ... xn? x1 ... xn
? x1 ... xn? x1 ... xn
? x1 ... xn
Sequential but fast
Thursday, November 4, 2010
Supervised Learning
T x1 ... xnT x1 ... xn
T x1 ... xnT x1 ... xn
T x1 ... xn
Model
Model
TT
TT
T
LearningAlgorithm
? x1 ... xn? x1 ... xn
? x1 ... xn? x1 ... xn
? x1 ... xn
Sequential but fast
Stateless, parallel
Thursday, November 4, 2010
Small example
• On 20 newsgroups
• converges in < 10,000 training examples (less than one pass through the data)
• accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes
• learning rate, regularization set automagically on held-out data
Thursday, November 4, 2010
System Structure
EvolutionaryProcess epvoid train(target, features)
AdaptiveLogisticRegression
20
1
OnlineLogisticRegression foldsvoid train(target, tracking, features)double auc()
CrossFoldLearner
51
Matrix betavoid train(target, features)double classifyScalar(features)
OnlineLogisticRegression
Thursday, November 4, 2010
Training API
public interface OnlineLearner {
void train(int actual, Vector instance);
void train(long trackingKey, int actual, Vector instance);
void train(long trackingKey, String groupKey, int actual, Vector instance);
void close();}
Thursday, November 4, 2010
Classification APIpublic class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close();
public double auc(); public State<Wrapper> getBest();}
CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood();
double p = model.classifyScalar(features);
Thursday, November 4, 2010
Speed?
• Encoding API for hashed feature vectors
• String, byte[] or double interfaces
• String allows simple parsing
• byte[] and double allows speed
• Abstract interactions supported
Thursday, November 4, 2010
Speed!
• Parsing and encoding dominate single learner
• Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core
• 20 million mixed text, categorical features with many interactions learned in ~ 1 hour
Thursday, November 4, 2010
More Speed!
• Evolutionary optimization of learning parameters allows simple operation
• 20x threading allows high machine use
• 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes
Thursday, November 4, 2010
Summary
• Mahout provides early production quality scalable data-mining
• New classification systems allow industrial scale classification
Thursday, November 4, 2010
Contact Info
Ted Dunningtdunning@maprtech.com
Thursday, November 4, 2010
Contact Info
Ted Dunningtdunning@maprtech.com
or tdunning@apache.com
Thursday, November 4, 2010
top related