carnegie mellon yucheng low aapo kyrola danny bickson a framework for machine learning and data...

Carnegie Mellon

Yucheng Low

AapoKyrola

DannyBickson

A Framework for Machine Learning and Data Mining in the Cloud

JosephGonzalez

CarlosGuestrin

JoeHellerstein

Big Data is Everywhere

72 Hours a MinuteYouTube28 Million

Wikipedia Pages

900 MillionFacebook Users

6 Billion Flickr Photos

2

“… data a new class of economic asset, like currency or gold.”

“…growing at 50 percent a year…”

How will wedesign and implement Big learning systems?

Big Learning

3

Shift Towards Use Of Parallelism in ML

GPUs Multicore Clusters Clouds Supercomputers

ML experts repeatedly solve the same parallel design challenges:

Race conditions, distributed state, communication…

Resulting code is very specialized:difficult to maintain, extend, debug…

Graduate

students

Avoid these problems by using high-level abstractions

4

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

5



6

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed



7

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

Embarrassingly Parallel independent computation No Communication needed



8

12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4

Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2

MapReduce – Reduce Phase

9

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Image Features

Attractive Face Statistics

Ugly Face Statistics

U A A U U U A A U A U A

Attractive Faces Ugly Faces

MapReduce for Data-Parallel ML

Excellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

MapReduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

Is there more toMachine Learning

?10

Carnegie Mellon

Exploit Dependencies

12

HockeyScuba Diving

Underwater Hockey

Scuba Diving

Scuba Diving

Scuba Diving

Hockey

Hockey

Hockey

Hockey

Graphs are Everywhere

Use

rs

Movies

Netflix


Doc

s

Words

Wiki

Text Analysis

Social Network

Probabilistic Analysis

13

Properties of Computation on Graphs

DependencyGraph

IterativeComputation

My Interests

Friends Interests

LocalUpdates

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting


Tensor Factorization

15

Bayesian Tensor Factorization

Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief Propagation

PageRank

CoEM

SVD

LDA

…Many others…Linear Solvers

Splash SamplerAlternating Least

Squares

21

2010Shared Memory

22

Limited CPU PowerLimited MemoryLimited Scalability

Distributed Cloud

- Distributing State- Data Consistency- Fault Tolerance

23

Unlimited amount of computation resources!(up to funding limitations)

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

24

Data GraphData associated with vertices and edges

Vertex Data:• User profile• Current interests estimates

Edge Data:• Relationship (friend, classmate, relative)

Graph:• Social Network

25

Distributed Graph

Partition the graph across multiple machines.

26

Distributed Graph

Ghost vertices maintain adjacency structure and replicate remote data.

“ghost” vertices

27

Distributed Graph

Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …)

“ghost” vertices

28


Consistency Model



29

Pagerank(scope){ // Update the current vertex data

// Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }

Update FunctionsUser-defined program: applied to avertex and transforms data in scope of vertex

Dynamic computation

Update function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation

30

32

Shared Memory Dynamic Schedule

e f g

kjih

dcbaCPU 1

CPU 2

a

h

a

b

b

i

Process repeats until scheduler is empty

Scheduler

Distributed Scheduling

e

ih

ba

f g

kj

dc

a

h

f

g

j

cb

i

Each machine maintains a schedule over the vertices it owns.

33Distributed Consensus used to identify completion

Ensuring Race-Free CodeHow much can computation overlap?

34


Consistency Model



36

Racing Collaborative Filtering

37

0 200 400 600 800 1000 1200 1400 16000.01

0.1

1

d=20 Racing d=20

Time

RMSE

Serializability

38

For every parallel execution, there exists a sequential execution of update functions which produces the same result.

CPU 1

CPU 2

SingleCPU

Parallel

Sequential

time

Serializability Example

39

Read

Write

Update functions one vertex apart can be run in parallel.

Edge Consistency

Overlapping regions are only read.

Stronger / Weaker consistency levels available

User-tunable consistency levelstrades off parallelism & consistency

Solution 1

Graph Coloring

Distributed Consistency

Solution 2

Distributed Locking

Edge Consistency via Graph Coloring

Vertices of the same color are all at least one vertex apart.Therefore, All vertices of the same color can be run in parallel!

42

Chromatic Distributed EngineTi

me

Execute tasks on all vertices of

color 0


color 0

Ghost Synchronization Completion + Barrier


color 1


color 1

Ghost Synchronization Completion + Barrier

43

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

44

Users Movies

45

Netflix Collaborative Filtering

Ideal

D=100

D=20

# machines

HadoopMPI

GraphLab

# machines

The Cost of Hadoop

46

ProblemsRequire a graph coloring to be available.

Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round.

47

Solution 1

Graph Coloring

Distributed Consistency

Solution 2

Distributed Locking

Distributed LockingEdge Consistency can be guaranteed through locking.

: RW Lock

49

Consistency Through LockingAcquire write-lock on center vertex, read-lock on adjacent.

50

SolutionPipelining

CPU Machine 1

Machine 2

A C

B D

Consistency Through LockingMulticore Setting

PThread RW-Locks

Distributed Setting

Distributed LocksChallenges

Latency

A C

B D

A C

B D

A

51

No Pipelining

lock scope 1

Process request 1

scope 1 acquiredupdate_function 1

release scope 1

Process release 1

Tim

e

52

Pipelining / Latency HidingHide latency using pipelining

lock scope 1

Process request 1

scope 1 acquired

update_function 1release scope 1

Process release 1

lock scope 2

Tim

e lock scope 3Process request 2

Process request 3scope 2 acquiredscope 3 acquired

update_function 2release scope 2

53


Consistency Model



54

What if machines fail? How do we provide fault tolerance?

Checkpoint

1: Stop the world2: Write state to disk

57

Snapshot Performance

No Snapshot

Snapshot

One slow machine

Because we have to stop the world, One slow machine slows everything down!

Snapshot time

Slow machine

How can we do better?

Take advantage of consistency

59

Checkpointing1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems.

snapshottedNot snapshotted

60

CheckpointingFine Grained Chandy-Lamport.

Easily implemented within GraphLab as an Update Function!

61

Async. Snapshot Performance

No Snapshot

Snapshot

One slow machine

No penalty incurred by the slow machine!

SummaryExtended GraphLab abstraction to distributed systemsTwo different methods of achieving consistency

Graph ColoringDistributed Locking with pipelining

Efficient implementationsAsynchronous Fault Tolerance with fined-grained Chandy-Lamport

Performance

Useability

Efficiency Scalabilitys

62

carnegie mellon yucheng low aapo kyrola danny bickson a framework for machine learning and data...

Documents

year slide

large dataparallel tasks

big data

mapreduce map phase

data mining

parallel design challenges

machine learning

big learning systems