apache hivemall @ apache bigdata '17, miami
TRANSCRIPT
![Page 1: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/1.jpg)
ApacheHivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig
1).ResearchEngineer,TreasureDataMakotoYUI@myui
2017/5/16ApacheBigData NorthAmerica'17,Miami
2).ResearchEngineer,NTTTakashiYamamuro@maropu
@ApacheHivemall
![Page 2: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/2.jpg)
Planofthetalk
1. IntroductiontoHivemall
2. HivemallonSpark
2017/5/16ApacheBigDataNorthAmerica'17,Miami
![Page 3: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/3.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallenteredApacheIncubatoronSept13,2016🎉
Sincethen,weinvited2contributorsasnewcommitters(acommitterhasbeenvotedasPPMC). Currently,weareworkingtowardthefirstApachereleaseonthisQ2.
hivemall.incubator.apache.org
![Page 4: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/4.jpg)
WhatisApacheHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Multi/Crossplatform VersatileScalableEase-of-use
![Page 5: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/5.jpg)
Hivemalliseasyandscalable…MLmadeeasyforSQLdevelopers
Borntobeparallelandscalable
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Ease-of-use
Scalable
CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
ThisqueryautomaticallyrunsinparallelonHadoop
![Page 6: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/6.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Hivemallisamulti/cross-platformMLlibrary
HiveQL SparkSQL/Dataframe API PigLatin
HivemallisMulti/Crossplatform..Multi/Crossplatform
predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive
![Page 7: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/7.jpg)
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File SystemCloud Storage
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
Hivemall’s TechnologyStack
2017/5/16ApacheBigDataNorthAmerica'17,Miami
AmazonS3
![Page 8: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/8.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonApacheHive
![Page 9: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/9.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonApacheSparkDataframe
![Page 10: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/10.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonSparkSQL
![Page 11: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/11.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonApachePig
![Page 12: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/12.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
OnlinePredictionbyApacheStreaming
![Page 13: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/13.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Versatile
HivemallisaVersatilelibrary..
ü NotonlyforMachineLearningü providesabunchofgenericutilityfunctions
EachorganizationhasownsetsofUDFsfordatapreprocessing
Don’tRepeatYourself!Don’tRepeatYourself!
![Page 14: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/14.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Hivemallgenericfunctions
ArrayandMap Bitandcompress StringandNLP
WewelcomecontributingyourgenericUDFstoHivemall!
GeoSpatial
Top-kprocessing
> TF/IDF
> TILE> MAP_URL
![Page 15: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/15.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Maptilingfunctions
![Page 16: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/16.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Maptilingfunctions
Zoom=10
Zoom=15
![Page 17: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/17.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
student class score3 a 902 a 801 b 706 b 60
Listtop-2studentsforeachclass
![Page 18: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/18.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
Listtop-2studentsforeachclass
SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank
FROMtable)tWHERErank<=2
RANKover()querydoesnotfinishesin24hoursLwhere20millionMOOCsclassesandavg1,000studentsineachclasses
![Page 19: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/19.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
Listtop-2studentsforeachclass
SELECTeach_top_k(2,class,score,class,student
)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass
)t
EACH_TOP_Kfinishesin2hoursJ
![Page 20: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/20.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Top-kqueryprocessingbyRANKOVER()
partitionbyclass
Node1
Sortbyclass,score
rankover()
rank>=2
![Page 21: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/21.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Top-kqueryprocessingbyEACH_TOP_K
distributedbyclass
Node1
Sortbyclass
each_top_k
OUTPUTonlyKitems
![Page 22: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/22.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
ComparisonbetweenRANKandEACH_TOP_K
distributedbyclass
Sortbyclass
each_top_k
Sortbyclass,score
rankover()
rank>=2
SORTING ISHEAVY
NEEDTOPROCESSALL
OUTPUTonlyKitems
Each_top_k isveryefficientwherethenumberofclassislarge
BoundedPriorityQueueisutilized
![Page 23: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/23.jpg)
ListofSupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
![Page 24: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/24.jpg)
RandomForestinHivemall
EnsembleofDecisionTrees
2017/5/16ApacheBigDataNorthAmerica'17,Miami
![Page 25: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/25.jpg)
TrainingofRandomForest
2017/5/16ApacheBigDataNorthAmerica'17,Miami
![Page 26: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/26.jpg)
PredictionofRandomForest
2017/5/16ApacheBigDataNorthAmerica'17,Miami
![Page 27: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/27.jpg)
SupportedAlgorithmsforRecommendation
2017/5/16ApacheBigDataNorthAmerica'17,Miami
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
![Page 28: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/28.jpg)
OtherSupportedAlgorithms
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ FeatureBinning✓ TF-IDFvectorizer✓ PolynomialExpansion✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer
Evaluationmetrics✓AUC,nDCG,logloss,precisionrecall@K,andetc
![Page 29: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/29.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering– FeatureHashing
![Page 30: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/30.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering– FeatureHashing
![Page 31: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/31.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering– FeatureBinning
Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution
MapAgesinto3bins
![Page 32: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/32.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureSelection– SignalNoiseRatio
![Page 33: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/33.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
EvaluationMetrics
![Page 34: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/34.jpg)
OtherSupportedFeatures
2017/5/16ApacheBigDataNorthAmerica'17,Miami
AnomalyDetection✓LocalOutlierFactor(LoF)✓ChangeFinder
Clustering/Topicmodels✓Onlinemini-batchLDA✓Onlinemini-batchPLSA
ChangePointDetection✓ChangeFinder✓SingularSpectrumTransformation
![Page 35: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/35.jpg)
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2017/5/16ApacheBigDataNorthAmerica'17,Miami
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
Anomaly/Change-pointDetectionbyChangeFinder
![Page 36: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/36.jpg)
Takethis…
Anomaly/Change-pointDetectionbyChangeFinder
2017/5/16ApacheBigDataNorthAmerica'17,Miami
![Page 37: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/37.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anomaly/Change-pointDetectionbyChangeFinder
…anddothis!
![Page 38: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/38.jpg)
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anomaly/Change-pointDetectionbyChangeFinder
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
![Page 39: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/39.jpg)
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anomaly/Change-pointDetectionbyChangeFinder
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
![Page 40: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/40.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.
• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.
Change-pointdetectionbySingularSpectrumTransformation
![Page 41: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/41.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Onlinemini-batchLDA
![Page 42: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/42.jpg)
ü XGBoost Integrationü SparsevectorsupportinRandomForestsü Dockersupport(forevaluation)ü Field-awareFactorizationMachinesü GeneralizedLinearModel
• OptimizerframeworkincludingADAM• L1/L2regularization
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Othernewfeaturesindevelopment
![Page 43: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/43.jpg)
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall on
![Page 44: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/44.jpg)
44Copyright©2016 NTT corp. All Rights Reserved.
Who am I• Takeshi Yamamuro• NTT corp. in Japan• OSS activities
• Apache Hivemall PPMC• Apache Spark contributer
![Page 45: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/45.jpg)
45Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?• Distributed data analytics engine, generalizing Map Reduce
![Page 46: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/46.jpg)
46Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs• easy-to-use, rich optimization
• 3. Integrate broadly• storages, libraries, ...
![Page 47: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/47.jpg)
47Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark
• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark
Whatʼs Hivemall on Spark?
![Page 48: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/48.jpg)
48Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall already has many fascinating ML algorithms and useful utilities
• + High barriers to add newer algorithms in MLlib
Whyʼs Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
![Page 49: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/49.jpg)
49Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v2.0 and v2.1
Current Status
- To compile for Spark v2.1
$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.1–DskipTests$ ls target/*spark*target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar...
![Page 50: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/50.jpg)
50Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v2.0 and v2.1
Current Status
- To compile for Spark v2.0
$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.0–DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...
![Page 51: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/51.jpg)
51Copyright©2016 NTT corp. All Rights Reserved.
• 1. Fetch training and test data• 2. Load these data in Spark• 3. Build a model• 4. Do predictions
4 Step Example
![Page 52: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/52.jpg)
52Copyright©2016 NTT corp. All Rights Reserved.
• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
1. Fetch training and test data
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2
![Page 53: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/53.jpg)
53Copyright©2016 NTT corp. All Rights Reserved.
2. Load data in Spark
// Download Spark-v2.1 and launch a spark-shell with Hivemall$ <HIVEMALL_HOME>/bin/spark-shell
// Create DataFrame from the bzipʼd libsvm-formatted filescala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")
scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)
![Page 54: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/54.jpg)
54Copyright©2016 NTT corp. All Rights Reserved.
2. Load data in Spark (Detailed)
0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN…
…
…
Load in parallel becausebzip2 is splittable
![Page 55: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/55.jpg)
55Copyright©2016 NTT corp. All Rights Reserved.
3. Build a model - DataFrame
scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)
![Page 56: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/56.jpg)
56Copyright©2016 NTT corp. All Rights Reserved.
3. Build a model - SQL
scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""
| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature
""".stripMargin)
![Page 57: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/57.jpg)
57Copyright©2016 NTT corp. All Rights Reserved.
4. Do predictions - DataFrame
scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache
# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))
![Page 58: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/58.jpg)
58Copyright©2016 NTT corp. All Rights Reserved.
4. Do predictions - SQL
scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid
""".stripMargin)
![Page 59: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/59.jpg)
59Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank",
rank().over(partitionBy($"group”).orderBy($"score".desc))).where($"rank" <= topK)
Use Spark vanilla APIs
![Page 60: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/60.jpg)
60Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank", rank().over(partitionBy($"group”).orderBy($"score".desc))
).where($"rank" <= topK)
1. Join leftDf with rightDf
2. Sort data in each group
3. Filter top-K data
Use Spark vanilla APIs
![Page 61: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/61.jpg)
61Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:val topkDf = leftDf.top_k_join(lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),(leftDf(“x”) + rightDf(“y”)).as(“score”)
)
Use a fused API implemented in Hivemall
![Page 62: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/62.jpg)
62Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group xgroup y
leftDf
rightDf
・・・
![Page 63: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/63.jpg)
63Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group xgroup y
・・・
・・・
K-lengthpriority queue
Compute top-K rows by using a priority queue
leftDf
rightDf
![Page 64: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/64.jpg)
64Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group xgroup y
・・・
・・・
K-lengthpriority queue
Compute top-K rows by using a priority queue
Only join top-K rowsleftDf
rightDf
![Page 65: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/65.jpg)
65Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
Spark Planner (Catalyst) Overview
![Page 66: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/66.jpg)
66Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
Codegen
A Physical Plan
![Page 67: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/67.jpg)
67Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
scala> topkDf.explain== Physical Plan ==*ShuffledHashJoinTopK 1, [group#10], [group#27]:- Exchange hashpartitioning(group#10, 200): +- LocalTableScan [group#10, x#11]+- Exchange hashpartitioning(group#27, 200)
+- LocalTableScan [group#27, y#28]
ʻ*ʼ in the head means a codegenʼd plan
![Page 68: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/68.jpg)
68Copyright©2016 NTT corp. All Rights Reserved.
• Benchmark Result
Add New Features in Spark
~13 times faster than vanilla APIs!!
![Page 69: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/69.jpg)
69Copyright©2016 NTT corp. All Rights Reserved.
• Support more useful functions• Spark only implements naive and basic functions in
terms of usability and maintainability• e.g.,)
• f latten - f latten a nested schema into f lat one• from_csv/to_csv – interconversion of CSV str ings and structured data with schemas• ...
• See more in the Hivemall user guide• https://hivemal l . incubator.apache.org /userguide /spark /misc /misc.html
Add New Features in Spark
![Page 70: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/70.jpg)
70Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))
![Page 71: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/71.jpg)
71Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))
![Page 72: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/72.jpg)
72Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
![Page 73: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/73.jpg)
73Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)
![Page 74: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/74.jpg)
74Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)
![Page 75: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/75.jpg)
75Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
![Page 76: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/76.jpg)
76Copyright©2016 NTT corp. All Rights Reserved.
• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions
• This integration will make you...• load built models and predict in parallel• build multiple models in parallel
for cross-validation
Support 3rd Party Library in Spark
![Page 77: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/77.jpg)
ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
2017/5/16ApacheBigDataNorthAmerica'17,Miami
WewelcomeyourcontributionstoApacheHivemallJ
HiveQL SparkSQL/Dataframe API PigLatin
![Page 78: Apache Hivemall @ Apache BigData '17, Miami](https://reader034.vdocuments.us/reader034/viewer/2022050613/5a6550ae7f8b9a587a8b49eb/html5/thumbnails/78.jpg)
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anyfeaturerequestorquestions?