hadoop summit eu 2013: parallel linear regression, iterativereduce, and yarn

Linear Regression and Metronome

Modern Big Data Analytics

Josh Patterson

Email:

[email protected]

Twitter:

@jpatanooga

Github:

https://github.com/jpatanooga

Past

Published in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the Smartgrid

Cloudera

Principal Solution Architect

Today

Independent Consultant

mailto:[email protected]

mailto:[email protected]



Sections

1. Modern Data Analytics

2. Parallel Linear Regression

3. Performance and Results

The Evolution of Data ToolsModern Data Analytics

The World as Optimization

Data tells us about our model/engine/product

We take this data and evolve our product towards a state of minimal market error

WSJ Special Section, Monday March 11, 2013

Zynga changing games based off player behavior

UPS cut fuel consumption by 8.4MM gallons

Ford used sentiment analysis to look at how new car features would be received

The Modern Data Landscape

Apps are coming but they need

Platforms

Components

Workflows

Lots of investment in Hadoop in this space

Lots of ETL pipelines

Lots of descriptive Statistics

Growing interest in Machine Learning

Hadoop as The Linux of Data

Hadoop has won the Cycle

Gartner: Hadoop will be in 2/3s of advanced analytics products by 2015 [1]

“Hadoop is the kernel of a distributed operating system, and all the other components around the kernel are now arriving on this stage”

---Doug Cutting

Today’s Hadoop ML Pipeline

Data cleansing / ETL performed with Hive or PigData In Place Processed

Mahout

R

Custom MapReduce Algorithm

Or Externally ProcessedSAS

SPSS

KXEN

Weka

As Focus Shifts to Applications

Data rates have been climbing fast

Speed at Scale becomes the new Killer App

Companies will want to leverage the Big Data infrastructure they’ve already been working with

Hadoop

HDFS as main storage system

A drive to validate big data investments with results

Emergence of applications which create “data products”

Patterson’s Law

“As the percent of your total data held in a storage system approaches 100% the amount of in-system processing and analytics also approaches 100%”

Tools Will Move onto Hadoop

Already seeing this with Vendors

Who hasn’t announced a SQL engine on Hadoop lately?

Trend will continue with machine learning tools

Mahout was the beginning

More are following

But what about parallel iterative algorithms?

Distributed Systems Are Hard

Lots of moving parts

Especially as these applications become more complicated

Machine learning can be a non-trivial operation

We need great building blocks that work well together

I agree with Jimmy Lin [3]: “keep it simple”

“make sure costs don’t outweigh benefits”

Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!

To Summarize

Data moving into Hadoop everywhere

Patterson’s Law

Focus on hadoop, build around next-gen “linux of data”

Need simple components to build next-gen data base apps

They should work cleanly with the cluster that the fortune 500 has: Hadoop

Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem

Minimize YATTL

Multivariate Linear Regression on Hadoop Parallel Linear Regression

Linear Regression

In linear regression, data is modeled using linear predictor functions

unknown model parameters are estimated from the data.

We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model

Y = (1*x0) + (c1*x1) + … + (cN*xN)

16

Machine Learning and Optimization

Algorithms

(Convergent) Iterative Methods

Newton’s Method

Quasi-Newton

Gradient Descent

Heuristics

AntNet

PSO

Genetic Algorithms

17

Stochastic Gradient Descent

Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11

Hypothesis about data

Cost function

Update function

https://class.coursera.org/ml/lecture/preview_view/11

https://class.coursera.org/ml/lecture/preview_view/11

18

Stochastic Gradient Descent

Training

Simple gradient descent procedure

Loss functions needs to be convex (with exceptions)

Linear Regression

Loss Function: squared error of prediction

Prediction: linear combination of coefficients and input variables

SGD

Model

Training Data

19

Mahout’s SGD

Currently Single Process

Multi-threaded parallel, but not cluster parallel

Runs locally, not deployed to the cluster

Tied to logistic regression implementation

20

Current Limitations

Sequential algorithms on a single node only goes so far

The “Data Deluge”

Presents algorithmic challenges when combined with large data sets

need to design algorithms that are able to perform in a distributed fashion

MapReduce only fits certain types of algorithms

21

Distributed Learning Strategies

McDonald, 2010

Distributed Training Strategies for the Structured Perceptron

Langford, 2007

Vowpal Wabbit

Jeff Dean’s Work on Parallel SGD

DownPour SGD

Sandblaster

22

MapReduce vs. Parallel

IterativeInput

Output

Map Map Map

Reduce Reduce

ProcessorProcessor ProcessorProcessor ProcessorProcessor

Superstep 1Superstep 1

ProcessorProcessor ProcessorProcessor

Superstep 2Superstep 2

. . .

ProcessorProcessor

23

YARN

Yet Another Resource Negotiator

Framework for scheduling distributed applications

Allows for any type of parallel application to run natively on hadoop

MRv2 is now a distributed application

24

IterativeReduce

Designed specifically for parallel iterative algorithms on Hadoop

Implemented directly on top of YARN

Intrinsic Parallelism

Easier to focus on problem

Not focusing on the distributed application part

25

IterativeReduce API

ComputableMaster

Setup()

Compute()

Complete()

ComputableWorker

Setup()

Compute()

WorkerWorker WorkerWorker WorkerWorker

MasterMaster

WorkerWorker WorkerWorker

MasterMaster

. . .

WorkerWorker

26

SGD Master

Collects all parameter vectors at each pass / superstep

Produces new global parameter vector

By averaging workers’ vectors

Sends update to all workers

Workers replace local parameter vector with new global parameter vector

27

SGD Worker

Each given a split of the total dataset

Similar to a map task

Performs local SGD pass

Local parameter vector sent to master at superstep

Stays active/resident between iterations

28

SGD: Serial vs Parallel

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

Split 1 Split 2 Split 3

…

Parallel Linear Regression with IterativeReduce

Based directly on work we did with Knitting Boar

Parallel logistic regression

Scales linearly with input size

Can produce a linear regression model off large amounts of data

Packaged in a new suite of parallel iterative algorithms called Metronome

100% Java, ASF 2.0 Licensed, on github

Unit Testing and IRUnit

Simulates the IterativeReduce parallel framework

Uses the same app.properties file that YARN applications do

Examples

https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/linearregression/iterativereduce/TestSimulateLinearRegressionIterativeReduce.java

https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingBoar_IRUnitSim.java









Metronome: Parallel Linear RegressionPerformance and Results

Running the Job via YARN

Build with Maven

Copy Jar to host with cluster access

Copy dataset to HDFS

Run job

Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties

Results

64 128 192 256 3200

50

100

150

200

Linear Regression - Parallel vs Serial

Parallel RunsSerial Runs

Megabytes Processed Total

Tota

l P

rocessin

g

Tim

e

Lessons Learned

Linear scale continues to be achieved with parameter averaging variations

Tuning is critical

Need to be good at selecting a learning rate

YARN still experimental, has caveats

Container allocation is still slow

Metronome continues to be experimental

Special Thanks

Michael Katzenellenbollen

Dr. James Scott

University of Texas at Austin

Dr. Jason Baldridge

University of Texas at Austin

Future Directions

More testing, stability

Cache vectors in memory for speed

Metronome

Take on properties of LibLinear

Plugable optimization, general linear models

YARN-centric first class Hadoop citizen

Focus on being a complement to Mahout

K-means, PageRank implementations

Github

IterativeReduce

https://github.com/emsixteeen/IterativeReduce

Metronome

https://github.com/jpatanooga/Metronome

Knitting Boar

https://github.com/jpatanooga/KnittingBoar







References

1. http://www.infoworld.com/d/business-intelligence/gartner-hadoop-will-be-in-two-thirds-of-advanced-analytics-products-2015-211475

2. https://cwiki.apache.org/MAHOUT/logistic-regression.html

3. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail!

• http://arxiv.org/pdf/1209.2191.pdf

http://www.infoworld.com/d/business-intelligence/gartner-hadoop-will-be-in-two-thirds-of-advanced-analytics-products-2015-211475




http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf




hadoop summit eu 2013: parallel linear regression, iterativereduce, and yarn

Technology