hadoop summit eu 2013: parallel linear regression, iterativereduce, and yarn
DESCRIPTION
Josh Patterson's Hadoop Summit EU 2013 talk on parallel linear linear regression on IterativeReduce and YARN.TRANSCRIPT
Linear Regression and Metronome
Modern Big Data Analytics
Josh Patterson
Email:
Twitter:
@jpatanooga
Github:
https://github.com/jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing Algorithm”
Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today
Independent Consultant
Sections
1. Modern Data Analytics
2. Parallel Linear Regression
3. Performance and Results
The Evolution of Data ToolsModern Data Analytics
The World as Optimization
Data tells us about our model/engine/product
We take this data and evolve our product towards a state of minimal market error
WSJ Special Section, Monday March 11, 2013
Zynga changing games based off player behavior
UPS cut fuel consumption by 8.4MM gallons
Ford used sentiment analysis to look at how new car features would be received
The Modern Data Landscape
Apps are coming but they need
Platforms
Components
Workflows
Lots of investment in Hadoop in this space
Lots of ETL pipelines
Lots of descriptive Statistics
Growing interest in Machine Learning
Hadoop as The Linux of Data
Hadoop has won the Cycle
Gartner: Hadoop will be in 2/3s of advanced analytics products by 2015 [1]
“Hadoop is the kernel of a distributed operating system, and all the other components around the kernel are now arriving on this stage”
---Doug Cutting
Today’s Hadoop ML Pipeline
Data cleansing / ETL performed with Hive or PigData In Place Processed
Mahout
R
Custom MapReduce Algorithm
Or Externally ProcessedSAS
SPSS
KXEN
Weka
As Focus Shifts to Applications
Data rates have been climbing fast
Speed at Scale becomes the new Killer App
Companies will want to leverage the Big Data infrastructure they’ve already been working with
Hadoop
HDFS as main storage system
A drive to validate big data investments with results
Emergence of applications which create “data products”
Patterson’s Law
“As the percent of your total data held in a storage system approaches 100% the amount of in-system processing and analytics also approaches 100%”
Tools Will Move onto Hadoop
Already seeing this with Vendors
Who hasn’t announced a SQL engine on Hadoop lately?
Trend will continue with machine learning tools
Mahout was the beginning
More are following
But what about parallel iterative algorithms?
Distributed Systems Are Hard
Lots of moving parts
Especially as these applications become more complicated
Machine learning can be a non-trivial operation
We need great building blocks that work well together
I agree with Jimmy Lin [3]: “keep it simple”
“make sure costs don’t outweigh benefits”
Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!
To Summarize
Data moving into Hadoop everywhere
Patterson’s Law
Focus on hadoop, build around next-gen “linux of data”
Need simple components to build next-gen data base apps
They should work cleanly with the cluster that the fortune 500 has: Hadoop
Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem
Minimize YATTL
Multivariate Linear Regression on Hadoop Parallel Linear Regression
Linear Regression
In linear regression, data is modeled using linear predictor functions
unknown model parameters are estimated from the data.
We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model
Y = (1*x0) + (c1*x1) + … + (cN*xN)
16
Machine Learning and Optimization
Algorithms
(Convergent) Iterative Methods
Newton’s Method
Quasi-Newton
Gradient Descent
Heuristics
AntNet
PSO
Genetic Algorithms
17
Stochastic Gradient Descent
Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11
Hypothesis about data
Cost function
Update function
18
Stochastic Gradient Descent
Training
Simple gradient descent procedure
Loss functions needs to be convex (with exceptions)
Linear Regression
Loss Function: squared error of prediction
Prediction: linear combination of coefficients and input variables
SGD
Model
Training Data
19
Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation
20
Current Limitations
Sequential algorithms on a single node only goes so far
The “Data Deluge”
Presents algorithmic challenges when combined with large data sets
need to design algorithms that are able to perform in a distributed fashion
MapReduce only fits certain types of algorithms
21
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Dean’s Work on Parallel SGD
DownPour SGD
Sandblaster
22
MapReduce vs. Parallel
IterativeInput
Output
Map Map Map
Reduce Reduce
ProcessorProcessor ProcessorProcessor ProcessorProcessor
Superstep 1Superstep 1
ProcessorProcessor ProcessorProcessor
Superstep 2Superstep 2
. . .
ProcessorProcessor
23
YARN
Yet Another Resource Negotiator
Framework for scheduling distributed applications
Allows for any type of parallel application to run natively on hadoop
MRv2 is now a distributed application
24
IterativeReduce
Designed specifically for parallel iterative algorithms on Hadoop
Implemented directly on top of YARN
Intrinsic Parallelism
Easier to focus on problem
Not focusing on the distributed application part
25
IterativeReduce API
ComputableMaster
Setup()
Compute()
Complete()
ComputableWorker
Setup()
Compute()
WorkerWorker WorkerWorker WorkerWorker
MasterMaster
WorkerWorker WorkerWorker
MasterMaster
. . .
WorkerWorker
26
SGD Master
Collects all parameter vectors at each pass / superstep
Produces new global parameter vector
By averaging workers’ vectors
Sends update to all workers
Workers replace local parameter vector with new global parameter vector
27
SGD Worker
Each given a split of the total dataset
Similar to a map task
Performs local SGD pass
Local parameter vector sent to master at superstep
Stays active/resident between iterations
28
SGD: Serial vs Parallel
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
Split 1 Split 2 Split 3
…
Parallel Linear Regression with IterativeReduce
Based directly on work we did with Knitting Boar
Parallel logistic regression
Scales linearly with input size
Can produce a linear regression model off large amounts of data
Packaged in a new suite of parallel iterative algorithms called Metronome
100% Java, ASF 2.0 Licensed, on github
Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/linearregression/iterativereduce/TestSimulateLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingBoar_IRUnitSim.java
Metronome: Parallel Linear RegressionPerformance and Results
Running the Job via YARN
Build with Maven
Copy Jar to host with cluster access
Copy dataset to HDFS
Run job
Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
Results
64 128 192 256 3200
50
100
150
200
Linear Regression - Parallel vs Serial
Parallel RunsSerial Runs
Megabytes Processed Total
Tota
l P
rocessin
g
Tim
e
Lessons Learned
Linear scale continues to be achieved with parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
YARN still experimental, has caveats
Container allocation is still slow
Metronome continues to be experimental
Special Thanks
Michael Katzenellenbollen
Dr. James Scott
University of Texas at Austin
Dr. Jason Baldridge
University of Texas at Austin
Future Directions
More testing, stability
Cache vectors in memory for speed
Metronome
Take on properties of LibLinear
Plugable optimization, general linear models
YARN-centric first class Hadoop citizen
Focus on being a complement to Mahout
K-means, PageRank implementations
Github
IterativeReduce
https://github.com/emsixteeen/IterativeReduce
Metronome
https://github.com/jpatanooga/Metronome
Knitting Boar
https://github.com/jpatanooga/KnittingBoar
References
1. http://www.infoworld.com/d/business-intelligence/gartner-hadoop-will-be-in-two-thirds-of-advanced-analytics-products-2015-211475
2. https://cwiki.apache.org/MAHOUT/logistic-regression.html
3. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!
• http://arxiv.org/pdf/1209.2191.pdf