Download - Introduction to Hivemall
![Page 1: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/1.jpg)
Hivemall:ScalableMachineLearningLibraryforApacheHive
ResearchEngineerMakotoYUI@myui
1
bit.ly/hivemall
![Page 2: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/2.jpg)
2
![Page 3: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/3.jpg)
3
![Page 4: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/4.jpg)
ExternalIntegrations
SQL
Server
CRM
RDBMS
App log
Sensor
Apache log
ERP
HiveBatch
AdhocPresto
API
ODBCJDBC
PUSH
Treasure Agent
BI tools
Data analysis
Treasure Data Collectors
Embedded
Embulk
Mobile SDK
JS SDK
Treasure Data Cloud Service
Machine Learning
900,000Records stored
per sec.
![Page 5: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/5.jpg)
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
Agenda
![Page 6: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/6.jpg)
What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
![Page 7: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/7.jpg)
WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)
bit.ly/hivemall-award
![Page 8: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/8.jpg)
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
8
Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression
List of supported Algorithms
![Page 9: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/9.jpg)
List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
9
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
![Page 10: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/10.jpg)
List of Algorithms for Recommendation
10
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
![Page 11: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/11.jpg)
Other Supported Algorithms
11
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
![Page 12: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/12.jpg)
Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more
Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.
Ø Churn Detection• Algorithm: Regression• OISIX and more
Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Adtech Companies, ISP portal, and more
Ø Value prediction of Real estates• Algorithm: Regression• Livesense
Industry use cases of Hivemall
12
![Page 13: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/13.jpg)
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
Agenda
![Page 14: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/14.jpg)
WhyHivemall
1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.
2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.
That’swhyIbuildHivemall.
![Page 15: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/15.jpg)
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
MachineLearning
file
![Page 16: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/16.jpg)
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Need to do expensive data preprocessing
(Joins, Filtering, and Formatting of Data that does not fit in memory)
MachineLearning
![Page 17: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/17.jpg)
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Do not scaleHave to learn R/Python APIs
![Page 18: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/18.jpg)
HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
Does not meet my needsIn terms of its scalability, ML algorithms, and usability
I ❤ scalableSQL query
![Page 19: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/19.jpg)
Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming
ScalaShell(REPL)H2O Rprogramming
GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)
C++APIprogrammingCommandLine
SurveyonexistingMLframeworks
ExistingdistributedmachinelearningframeworksareNOTeasytouse
![Page 20: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/20.jpg)
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)
✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
![Page 21: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/21.jpg)
21
HivemallonApacheSpark
Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6
![Page 22: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/22.jpg)
1. What is Hivemall
2. Why Hivemall
3. Hivemall Internals
4. How to use Hivemall
Agenda
![Page 23: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/23.jpg)
ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)
HowHivemallworksintraining
+1,<1,2>..+1,<1,7,9>
-1,<1,3,9>..+1,<3,8>
tuple<label,array<features>>
tuple<feature,weights>
Predictionmodel
UDTF
Relation<feature,weights>
param-mix param-mix
Trainingtable
Shufflebyfeature
train train
● Resulting prediction model is a relation of feature and its weight
● # of mapper and reducers are configurable
UDTF is a function that returns a relation
ParallelismisPowerful
![Page 24: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/24.jpg)
AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3asSELECT*
FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)
FROMtraining
) tCLUSTER BY rand()
![Page 25: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/25.jpg)
1. What is Hivemall
2. Why Hivemall
3. Hivemall Internals
4. How to use Hivemall
Agenda
![Page 26: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/26.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation 26
![Page 27: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/27.jpg)
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
27
![Page 28: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/28.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
28
![Page 29: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/29.jpg)
create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
29
![Page 30: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/30.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
30
![Page 31: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/31.jpg)
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
31
![Page 32: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/32.jpg)
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
32
![Page 33: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/33.jpg)
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
33
![Page 34: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/34.jpg)
HowtouseHivemall- Prediction
CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
34
![Page 35: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/35.jpg)
Real-timeprediction
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
35
bit.ly/hivemall-rtp
![Page 36: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/36.jpg)
Conclusion
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
36
Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind
Do not require coding, packaging, compiling or introducing a new programming language or APIs.
Hivemall’s Positioning
![Page 37: Introduction to Hivemall](https://reader034.vdocuments.us/reader034/viewer/2022052418/58f9a950760da3da068b6d68/html5/thumbnails/37.jpg)
Thank you!MakotoYUI- Researchengineer/TreasureData
twitter:@myui
37
Download Hivemall from bit.ly/hivemall