hivemall tech talk at redwood, ca
TRANSCRIPT
Makoto YUI -‐ Research engineer, Treasure Datatwitter: @myui
Hivemall: Scalable Machine Learning Library for Apache Hive
➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service
(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute
of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project and
Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST
➢XML native database and Parallel Database systems➢Super programmer award from the MITOU Foundation
(Government funded program for finding young and talented programmers)➢ Super creators in Treasure Data: Sada Furuhashi,
Keisuke Nishida2
Who am I ?
1. What is Hivemall (very short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
3
Agenda
What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
Hadoop HDFS
MapReduce(MRv1)
Hive / PIG
Hivemall
Apache YARN
Apache Tez DAG processing MR v2
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
4
5
Won IDG’s InfoWorld 2014Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem
bit.ly/hivemall-‐award
List of functions in Hivemall v0.3
• Classification (both binary-‐ and multi-‐class)✓ Perceptron✓ Passive Aggressive (PA)✓ Confidence Weighted (CW)✓ Adaptive Regularization of Weight Vectors (AROW)
✓ Soft Confidence Weighted (SCW)✓ AdaGrad+RDA
• Regression✓ Logistic Regression (SGD)✓ PA Regression✓ AROW Regression✓ AdaGrad✓ AdaDELTA
• kNN and Recommendation✓Minhash and b-‐Bit Minhash(LSH variant)
✓ Similarity Search using K-‐NN✓Matrix Factorization
• Feature engineering✓ Feature hashing✓ Feature scaling(normalization, z-‐score)
✓ TF-‐IDF vectorizer
6
Treasure Data support Hivemallv0.3 in the HDP2 environment
1. What is Hivemall (very short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
7
Agenda
Why Hivemall
1. In my experience working on ML, I used Hive for preprocessing and Python (scikit-‐learn etc.) for ML. This was INEFFICIENT and ANNOYING. Also, Python is not as scalable as Hive.
2. Why not run ML algorithms inside Hive? Less components to manage and more scalable.
8
That’s why I build Hivemall.
How I used to do ML projects before Hivemall
9
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
Machine Learning
file
How I used to do ML projects before Hivemall
10
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Need to do expensive data preprocessing
(Joins, Filtering, and Formatting of Data that does not fit in memory)
Machine Learning
How I used to do ML projects before Hivemall
11
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Do not scaleHave to learn R/Python APIs
How I used to do ML before Hivemall
12
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
Does not meet my needsIn terms of its scalability, ML algorithms, and usability
I ❤️ scalableSQL query
Framework User interfaceMahout Java API ProgrammingSpark MLlib/MLI Scala API programming
Scala Shell (REPL)H2O R programming
GUICloudera Oryx Http REST API programmingVowpal Wabbit(w/ Hadoop streaming)
C++ API programmingCommand Line
Survey on existing ML frameworks
Existing distributed machine learning frameworksare NOT easy to use
13
Hivemall’s Vision: ML on SQL
100+ lines
of code
Classification with Mahout
CREATE TABLE lr_model ASSELECTfeature, -‐-‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -‐-‐ map-‐only taskGROUP BY feature; -‐-‐ shuffled to reducers
✓Machine Learning made easy for SQL developers (ML for the rest of us)
✓ Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in parallel on Hadoop 14
Implemented machine learning algorithms as User-‐Defined Table generating Functions (UDTFs)
How Hivemall works in training
+1, <1,2>..+1, <1,7,9>
-‐1, <1,3, 9>..+1, <3,8>
tuple<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation<feature, weights>
param-‐mix param-‐mix
Training table
Shuffle by feature
train train
! Resulting prediction model is a relation of feature and its weight
! # of mapper and reducers are configurable
16
UDTF is a function that returns a relation
Parallelism is Powerful
train train
+1, <1,2>..+1, <1,7,9>
-‐1, <1,3, 9>..+1, <3,8>
merge
tuple<label, array<features >
array<weight>
array<sum of weight>, array<count>
Training table
Prediction model
-‐1, <2,7, 9>..+1, <3,8>
final merge
merge
-‐1, <2,7, 9>..+1, <3,8>
train train
array<weight>
Why not UDAF
4 ops in parallel
2 ops in parallel
No parallelism
Machine learning as an aggregate function
Bottleneck in the final mergeThroughput limited by its fan out
Memory consumptiongrows
Parallelismdecreases
17
Problem that I faced: IterationsIterations are mandatory to get a good prediction model• However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS• Spark avoid it by in-‐memory computation
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
iter. 1 iter. 2
Input18
val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {val gradient = data.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)w -= gradient
}Repeated MapReduce steps
to do gradient descent
For each node, loads data in memory once
This is just a toy example! Why?
Training with Iterations in SparkLogistic Regression example of Spark
Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required)
19
What MLlib actually do?
Val data = ..
for (i <- 1 to numIterations) {val sampled = val gradient =
w -= gradient}
Mini-‐batch Gradient Descent with Sampling
Iterations are mandatory for convergence because each iteration uses only small fraction of data
GradientDescent.scalabit.ly/spark-‐gd
sample subset of data (partitioned RDD)
averaging the subgradients over the sampled data using Spark MapReduce
20
Alternative Approach in HivemallHivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3asSELECT*
FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)FROMtraining
) tCLUSTER BY rand()
21
Map-‐only shuffling and amplifying
rand_amplify UDTF randomly shuffles the input rows for each Map task
CREATE VIEW training_x3asSELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *) as (rowid, label, features)
FROMtraining;
22
Detailed plan w/ map-‐local shuffle
…
Reduce taskMerge
Aggregate
Reduce write
Map
taskTable scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Map
taskTable scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Reduce taskMerge
Aggregate
Reduce write
Scanned entries are amplified and then shuffledNote this is a pipeline op.
The Rand Amplifier operator is interleaved between the table scan and the training operator
23
Shuffle (distributed by feature)
Method ELAPSED TIME (sec) AUC
Plain 89.718 0.734805
amplifier+clustered by(a.k.a. global shuffle)
479.855 0.746214
rand_amplifier (a.k.a. map-‐local shuffle)
116.424 0.743392
Performance effects of amplifiers
With the map-‐local shuffle, prediction accuracy got improved with an acceptable overhead
24
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Data preparation 26
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall -‐ Data preparation
Define a Hive table for training/testing data
27
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Feature Engineering
28
create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label})
as label, features
from e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall -‐ Feature Engineering
Transforming a label value to a value between 0.0 and 1.0
29
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Training
30
How to use Hivemall -‐ Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Training by logistic regression
map-‐only task to learn a prediction model
Shuffle map-‐outputs to reduces by feature
Reducers perform model averaging in parallel
31
How to use Hivemall -‐ Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive weights for avg
+0.7, +0.3, +0.2, -‐0.1, +0.7
Training for the CW classifier
32
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Prediction
33
How to use Hivemall -‐ Prediction
CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
Prediction is done by LEFT OUTER JOINbetween test data and prediction model
No need to load the entire model into memory
34
How to use Hivemall
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
Feature Vector
Feature Vector
Label
Export prediction models
35
Real-‐time Prediction on Treasure Data
Run batch trainingjob periodically
Real-‐time predictionon a RDBMS
Periodicalexport
36
Conclusion
Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs
37
Ø For SQL users that need MLØ For whom already using HiveØ Easy-‐of-‐use and scalability in mindDo not require coding, packaging, compiling or introducing a new programming language or APIs.
Hivemall’s Positioning
Thank you!Makoto YUI -‐ Research engineer / Treasure Data
twitter: @myui
38
Download Hivemall from bit.ly/hivemall