hadoop summit eu 2013: parallel linear regression, iterativereduce, and yarn

38
Linear Regression and Metronome Modern Big Data Analytics

Upload: josh-patterson

Post on 06-May-2015

4.716 views

Category:

Technology


1 download

DESCRIPTION

Josh Patterson's Hadoop Summit EU 2013 talk on parallel linear linear regression on IterativeReduce and YARN.

TRANSCRIPT

Page 1: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Linear Regression and Metronome

Modern Big Data Analytics

Page 2: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Josh Patterson

Email:

[email protected]

Twitter:

@jpatanooga

Github:

https://github.com/jpatanooga

Past

Published in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the Smartgrid

Cloudera

Principal Solution Architect

Today

Independent Consultant

Page 3: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Sections

1. Modern Data Analytics

2. Parallel Linear Regression

3. Performance and Results

Page 4: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

The Evolution of Data ToolsModern Data Analytics

Page 5: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

The World as Optimization

Data tells us about our model/engine/product

We take this data and evolve our product towards a state of minimal market error

WSJ Special Section, Monday March 11, 2013

Zynga changing games based off player behavior

UPS cut fuel consumption by 8.4MM gallons

Ford used sentiment analysis to look at how new car features would be received

Page 6: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

The Modern Data Landscape

Apps are coming but they need

Platforms

Components

Workflows

Lots of investment in Hadoop in this space

Lots of ETL pipelines

Lots of descriptive Statistics

Growing interest in Machine Learning

Page 7: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Hadoop as The Linux of Data

Hadoop has won the Cycle

Gartner: Hadoop will be in 2/3s of advanced analytics products by 2015 [1]

“Hadoop is the kernel of a distributed operating system, and all the other components around the kernel are now arriving on this stage”

---Doug Cutting

Page 8: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Today’s Hadoop ML Pipeline

Data cleansing / ETL performed with Hive or PigData In Place Processed

Mahout

R

Custom MapReduce Algorithm

Or Externally ProcessedSAS

SPSS

KXEN

Weka

Page 9: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

As Focus Shifts to Applications

Data rates have been climbing fast

Speed at Scale becomes the new Killer App

Companies will want to leverage the Big Data infrastructure they’ve already been working with

Hadoop

HDFS as main storage system

A drive to validate big data investments with results

Emergence of applications which create “data products”

Page 10: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Patterson’s Law

“As the percent of your total data held in a storage system approaches 100% the amount of in-system processing and analytics also approaches 100%”

Page 11: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Tools Will Move onto Hadoop

Already seeing this with Vendors

Who hasn’t announced a SQL engine on Hadoop lately?

Trend will continue with machine learning tools

Mahout was the beginning

More are following

But what about parallel iterative algorithms?

Page 12: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Distributed Systems Are Hard

Lots of moving parts

Especially as these applications become more complicated

Machine learning can be a non-trivial operation

We need great building blocks that work well together

I agree with Jimmy Lin [3]: “keep it simple”

“make sure costs don’t outweigh benefits”

Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!

Page 13: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

To Summarize

Data moving into Hadoop everywhere

Patterson’s Law

Focus on hadoop, build around next-gen “linux of data”

Need simple components to build next-gen data base apps

They should work cleanly with the cluster that the fortune 500 has: Hadoop

Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem

Minimize YATTL

Page 14: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Multivariate Linear Regression on Hadoop Parallel Linear Regression

Page 15: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Linear Regression

In linear regression, data is modeled using linear predictor functions

unknown model parameters are estimated from the data.

We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model

Y = (1*x0) + (c1*x1) + … + (cN*xN)

Page 16: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

16

Machine Learning and Optimization

Algorithms

(Convergent) Iterative Methods

Newton’s Method

Quasi-Newton

Gradient Descent

Heuristics

AntNet

PSO

Genetic Algorithms

Page 17: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

17

Stochastic Gradient Descent

Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11

Hypothesis about data

Cost function

Update function

Page 18: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

18

Stochastic Gradient Descent

Training

Simple gradient descent procedure

Loss functions needs to be convex (with exceptions)

Linear Regression

Loss Function: squared error of prediction

Prediction: linear combination of coefficients and input variables

SGD

Model

Training Data

Page 19: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

19

Mahout’s SGD

Currently Single Process

Multi-threaded parallel, but not cluster parallel

Runs locally, not deployed to the cluster

Tied to logistic regression implementation

Page 20: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

20

Current Limitations

Sequential algorithms on a single node only goes so far

The “Data Deluge”

Presents algorithmic challenges when combined with large data sets

need to design algorithms that are able to perform in a distributed fashion

MapReduce only fits certain types of algorithms

Page 21: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

21

Distributed Learning Strategies

McDonald, 2010

Distributed Training Strategies for the Structured Perceptron

Langford, 2007

Vowpal Wabbit

Jeff Dean’s Work on Parallel SGD

DownPour SGD

Sandblaster

Page 22: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

22

MapReduce vs. Parallel

IterativeInput

Output

Map Map Map

Reduce Reduce

ProcessorProcessor ProcessorProcessor ProcessorProcessor

Superstep 1Superstep 1

ProcessorProcessor ProcessorProcessor

Superstep 2Superstep 2

. . .

ProcessorProcessor

Page 23: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

23

YARN

Yet Another Resource Negotiator

Framework for scheduling distributed applications

Allows for any type of parallel application to run natively on hadoop

MRv2 is now a distributed application

Page 24: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

24

IterativeReduce

Designed specifically for parallel iterative algorithms on Hadoop

Implemented directly on top of YARN

Intrinsic Parallelism

Easier to focus on problem

Not focusing on the distributed application part

Page 25: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

25

IterativeReduce API

ComputableMaster

Setup()

Compute()

Complete()

ComputableWorker

Setup()

Compute()

WorkerWorker WorkerWorker WorkerWorker

MasterMaster

WorkerWorker WorkerWorker

MasterMaster

. . .

WorkerWorker

Page 26: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

26

SGD Master

Collects all parameter vectors at each pass / superstep

Produces new global parameter vector

By averaging workers’ vectors

Sends update to all workers

Workers replace local parameter vector with new global parameter vector

Page 27: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

27

SGD Worker

Each given a split of the total dataset

Similar to a map task

Performs local SGD pass

Local parameter vector sent to master at superstep

Stays active/resident between iterations

Page 28: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

28

SGD: Serial vs Parallel

Model

Training Data

Worker 1

Master

Partial Model

Global Model

Worker 2

Partial Model

Worker N

Partial Model

Split 1 Split 2 Split 3

Page 29: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Parallel Linear Regression with IterativeReduce

Based directly on work we did with Knitting Boar

Parallel logistic regression

Scales linearly with input size

Can produce a linear regression model off large amounts of data

Packaged in a new suite of parallel iterative algorithms called Metronome

100% Java, ASF 2.0 Licensed, on github

Page 30: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Unit Testing and IRUnit

Simulates the IterativeReduce parallel framework

Uses the same app.properties file that YARN applications do

Examples

https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/linearregression/iterativereduce/TestSimulateLinearRegressionIterativeReduce.java

https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/java/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingBoar_IRUnitSim.java

Page 31: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Metronome: Parallel Linear RegressionPerformance and Results

Page 32: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Running the Job via YARN

Build with Maven

Copy Jar to host with cluster access

Copy dataset to HDFS

Run job

Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties

Page 33: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Results

64 128 192 256 3200

50

100

150

200

Linear Regression - Parallel vs Serial

Parallel RunsSerial Runs

Megabytes Processed Total

Tota

l P

rocessin

g

Tim

e

Page 34: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Lessons Learned

Linear scale continues to be achieved with parameter averaging variations

Tuning is critical

Need to be good at selecting a learning rate

YARN still experimental, has caveats

Container allocation is still slow

Metronome continues to be experimental

Page 35: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Special Thanks

Michael Katzenellenbollen

Dr. James Scott

University of Texas at Austin

Dr. Jason Baldridge

University of Texas at Austin

Page 36: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Future Directions

More testing, stability

Cache vectors in memory for speed

Metronome

Take on properties of LibLinear

Plugable optimization, general linear models

YARN-centric first class Hadoop citizen

Focus on being a complement to Mahout

K-means, PageRank implementations

Page 37: Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN

Github

IterativeReduce

https://github.com/emsixteeen/IterativeReduce

Metronome

https://github.com/jpatanooga/Metronome

Knitting Boar

https://github.com/jpatanooga/KnittingBoar