danny bickson

Carnegie Mellon University

Danny Bickson

YuchengLow

AapoKyrola

Carlos Guestrin

JoeHellerstein

AlexSmola

Parallel Machine Learning for Large-Scale Graphs

JosephGonzalez

The GraphLab Team:

Parallelism is DifficultWide array of different parallel architectures:

Different challenges for each architecture

GPUs Multicore Clusters Clouds Supercomputers

High Level Abstractions to make things easier

How will wedesign and implement

parallel learning systems?

Map-Reduce / HadoopBuild learning algorithms on-top of

high-level parallel abstractions

... a popular answer:

BeliefPropagation

Label Propagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Example of Graph Parallelism

PageRank ExampleIterate:

Where:α is the random reset probabilityL[j] is the number of links on page j

Properties of Graph Parallel Algorithms

DependencyGraph

IterativeComputation

My Rank

Friends Rank

LocalUpdates

BeliefPropagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

PageRank

Addressing Graph-Parallel MLWe need alternatives to Map-Reduce

CrossValidation

Feature Extraction

Map Reduce

Map Reduce?Pregel (Giraph)?

BarrierPregel (Giraph)

Bulk Synchronous Parallel Model:

Compute Communicate

Bulk synchronous computation can be

highly inefficient

Problem:

BSP Systems Problem:Curse of the Slow Job

Iterations

BeliefPropagationSVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

PageRank

The Need for a New AbstractionIf not Pregel, then what?

CrossValidation

Feature Extraction

Map Reduce

Pregel (Giraph)

The GraphLab SolutionDesigned specifically for ML needs

Express data dependenciesIterative

Simplifies the design of parallel programs:Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architectures

MulticoreDistributedCloud computing GPU implementation in progress

What is GraphLab?

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge.

Vertex Data:• User profile text• Current interests estimates

Edge Data:• Similarity weights

Graph:• Social Network

pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

;][)1(][][

ji jRWiR

Update FunctionsAn update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

Dynamic computation

The Scheduler

The scheduler determines the order that vertices are updated

dcba b

The process repeats until the scheduler is empty

Ensuring Race-Free CodeHow much can computation overlap?

Need for Consistency?

No Consistenc

y Higher Throug

hput(#updates/sec)

Potentially Slower

Convergence of ML

Inconsistent ALS

0 2000000 4000000 6000000 8000000

20 Dynamic Inconsistent

Dynamic

Updates

Netflix data, 8 cores

Consistent

Even Simple PageRank can be Dangerous

GraphLab_pagerank(scope) {ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors )

sum = sum + neighbor.value / nbr.num_out_edges

sum = ALPHA + (1-ALPHA) * sum…

Inconsistent PageRank

Even Simple PageRank can be Dangerous

GraphLab_pagerank(scope) {ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors)

CPU 1 CPU 2Read

Read-write race CPU 1 reads bad PageRank estimate, as CPU 2 computes value

Race Condition Can Be Very SubtleGraphLab_pagerank(scope) {

ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors)

sum = sum + neighbor.value / neighbor.num_out_edges

GraphLab_pagerank(scope) {sum = 0forall (neighbor in scope.in_neighbors)

sum = ALPHA + (1-ALPHA) * sumscope.center_value = sum …

This was actually encountered in user code.

GraphLab Ensures Sequential ConsistencyFor each parallel execution, there exists a sequential execution of update functions which produces the same result.

SingleCPU

Parallel

Sequential

Consistency Rules

Guaranteed sequential consistency for all update functions

Full Consistency

Obtaining More Parallelism

Edge Consistency

CPU 1 CPU 2

What algorithms are implemented in

GraphLab?

Bayesian Tensor Factorization

Gibbs SamplingDynamic Block Gibbs Sampling

MatrixFactorization

Belief Propagation PageRank

K-Means

…Many others…Linear Solvers

Splash SamplerAlternating Least

Squares

GraphLab Libraries

Matrix factorizationSVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD

Linear SolversJacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG

ClusteringK-means, Fuzzy K-means, LDA, K-core decomposition

InferenceDiscrete BP, NBP, Kernel BP

Efficient MulticoreCollaborative Filtering

LeBuSiShu team – 5th place in track1

Institute of AutomationChinese Academy of Sciences

Machine Learning DeptCarnegie Mellon University

ACM KDD CUP Workshop 2011

Yao Wu Qiang Yan Danny Bickson Yucheng LowQing Yang

ACM KDD CUP 2011

• Task: predict music score• Two main challenges:

• Data magnitude – 260M ratings• Taxonomy of data

Data taxonomy

Our approach

• Use ensemble method• Custom SGD algorithm for handling

taxonomy

Ensemble method

• Solutions are merged using linear regression

Performance results

Blended Validation RMSE: 19.90

Classical Matrix Factorization

Sparse Matrix

Features of the ArtistFeatures of the AlbumItem Specific Features

“Effective Feature of an Item”

Intuitively, features of an artist and features of his/her album should be “similar”. How do we express this?

Artist

• Penalty terms which ensure Artist/Album/Track features are “close”

• Strength of penalty depends on “normalized rating similarity”

(See neighborhood model)

Fine Tuning ChallengeDataset has around 260M observed ratings12 different algorithms, total 53 tunable parametersHow do we train and cross validate all these parameters?

USE GRAPHLAB!

16 Cores Runtime

Speedup plots

Who is using GraphLab?

Universities using GraphLab

Companies tyring out GraphLab2400++ Unique Downloads Tracked

(possibly many more from direct repository checkouts)

Startups using GraphLab

User community

Performance results

GraphLab vs. Pregel (BSP)

Multicore PageRank (25M Vertices, 355M Edges)

0 10 20 30 40 50 60 701

1000000

100000000

Number of Updates

tices 51% updated only once

40006000

800010000

1200014000

1.00E-021.00E+001.00E+021.00E+041.00E+061.00E+08

Runtime (s)

GraphLab

Pregel(via GraphLab)

0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+091.00E-02

1.00E+00

1.00E+02

1.00E+04

1.00E+06

1.00E+08

Updates

GraphLab

Pregel(via GraphLab)

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the dog

Australia

Catalina Island

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Dog” an animal?Is “Catalina” a place?

Vertices: 2 MillionEdges: 200 Million

0 2 4 6 8 10 12 14 160

Number of CPUs

Optimal

GraphLab CoEM

CoEM (Rosie Jones, 2005)

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop 95 Cores 7.5 hrs

Carnegie Mellon

GraphLab in the Cloud

CoEM (Rosie Jones, 2005)

Optimal

0 2 4 6 8 10 12 14 160

Number of CPUs

LargeGraphLab 16 Cores 30 minHadoop 95 Cores 7.5 hrs

GraphLabin the Cloud

32 EC2 machines

80 secs

0.3% of Hadoop time

Cost-Time Tradeoff video co-segmentation results

more machines, higher cost

era few

machines helps a lot

diminishingreturns

Netflix Collaborative FilteringAlternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Movies

DHadoopMPI

GraphLab

Multicore Abstraction Comparison

Netflix Matrix Factorization

0 2000000 4000000 6000000 8000000 10000000-0.036

-0.034

-0.032

-0.028

-0.026

-0.024

-0.022

DynamicRound Robin

Updates

Dynamic Computation,Faster Convergence

The Cost of Hadoop

Fault Tolerance

Fault-ToleranceLarger Problems Increased chance of Machine Failure

GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms

Synchronous SnapshotsChandi-Lamport Asynchronous Snapshots

Synchronous Snapshots

Run GraphLab Run GraphLab

Barrier + Snapshot

Curse of the slow machine

sync.Snapshot

No Snapshot

Curse of the Slow Machine

Run GraphLabRun GraphLab

Barrier + Snapshot

Run GraphLabRun GraphLab

Curse of the slow machine

sync.Snapshot

No Snapshot

Delayed sync.Snapshot

Asynchronous Snapshots

struct chandy_lamport { void operator()(icontext_type& context) {

save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() )

{if (edge.source() was not marked as

saved) {save(context.edge_data(edge));context.schedule(edge.source(),

chandy_lamport());}

}... Repeat for context.out_edgesMark context.vertex() as saved;

Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency

Snapshot Performance

Async.Snapshotsync.

Snapshot

No Snapshot

Snapshot with 15s fault injection

No SnapshotAsync.

Snapshotsync.

Snapshot

Halt 1 out of 16 machines 15s

New challenges

Natural Graphs Power Law

Top 1% of vertices is adjacent to

53% of the edges!

Yahoo! Web Graph: 1.4B Verts, 6.7B Edges

“Power Law”

Problem: High Degree Vertices

High degree vertices limit parallelism:

Touch a LargeAmount of State

Requires Heavy Locking

Processed Sequentially

Split gather and scatter across machines:

High Communication in Distributed Updates

Machine 1 Machine 2

Data from neighbors transmitted separately

across network

High Degree Vertices are Common

Movies

Netflix

“Social” People Popular Movies

θZwZwZwZw

Hyper Parameters

Common Words

Factorized Update Functors

Delta Update Functors

Two Core Changes to Abstraction

Monolithic Updates

Gather Apply Scatter

Decomposed Updates

Monolithic Updates Composable Update “Messages”

(f1o f2)( )

Decomposable Update Functors

Locks are acquired only for region within a scope Relaxed Consistency

+ + … + Δ

ParallelSum

User Defined:Gather( ) ΔY

Δ1 + Δ2 Δ3

Y Scope

Gather

YApply( , Δ) Y

Apply the accumulated value to center vertexUser Defined:

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

Factorized PageRankdouble gather(scope, edge) {

return edge.source().value().rank /

scope.num_out_edge(edge.source())}

double merge(acc1, acc2) { return acc1 + acc2 }

void apply(scope, accum) {old_value = scope.center_value().rankscope.center_value().rank = ALPHA + (1 - ALPHA) *

accumscope.center_value().residual =

abs(scope.center_value().rank – old_value)}

void scatter(scope, edge) {if (scope.center_vertex().residual > EPSILON)

reschedule_schedule(edge.target())}

Split gather and scatter across machines:

Factorized Updates: Significant Decrease in Communication

( o )( )Y

YYF1 F2

Small amount of data transmitted over network

Factorized ConsistencyNeighboring vertices maybe be updated simultaneously:

Gather Gather

Factorized Consistency LockingGather on an edge cannot occur during apply:

Gather

Vertex B gathers on other neighbors while A is performing Apply

Decomposable Loopy Belief Propagation

Gather: Accumulates product of in messages

Apply: Updates central belief

Scatter: Computes out messages and schedules adjacent vertices

Decomposable Alternating Least Squares (ALS)

Movie Factors (X)

rs MoviesNetflix

Movies

Gather: Sum terms

Update Function:

Apply: matrix inversion & multiply

Comparison of Abstractions

0 1000 2000 3000 4000 5000 60001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08

Runtime (s)

GraphLab1

FactorizedUpdates

Need for Vertex Level Asynchrony

Exploit commutative associative “sum”

+ + + + + Y

Costly gather for a single change!

Commut-Assoc Vertex Level Asynchrony

+ + + + + Y

+ + + + + + Δ Y

Commut-Assoc Vertex Level Asynchrony

Delta Updates: Vertex Level Asynchrony

+ + + + + + Δ YOld (Cached) Sum

YΔ Δ

Delta Updates: Vertex Level Asynchrony

+ + + + + + Δ YOld (Cached) Sum

Delta Updatevoid update(scope, delta) {

scope.center_value() = scope.center_value() + delta

if(abs(delta) > EPSILON) {out_delta = delta * (1 – ALPHA) *

1 / scope.num_out_edge(edge.source())

reschedule_out_neighbors(delta)}

double merge(delta, delta) { return delta + delta }

Program starts with: schedule_all(ALPHA)

Multicore Abstraction Comparison

0 2000 4000 6000 8000 10000 12000 140001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08

Factorized

GraphLab 1

Simulated Pregel

Runtime (s)

Distributed Abstraction Comparison

Distributed PageRank (25M Vertices, 355M Edges)

2 3 4 5 6 7 80

# Machines (8 CPUs per Machine)

2 3 4 5 6 7 80

# Machines (8 CPUs per Machine)

B)GraphLab1

GraphLab2 (Delta Updates)

GraphLab1

GraphLab2 (Delta Updates)

PageRankAltavista Webgraph 2002

1.4B vertices, 6.7B edges

Hadoop 9000 s800 cores

Prototype GraphLab2 431s512 cores

Known Inefficiencies.

2x gain possible

Decomposed Update Functions: Expose parallelism in high-degree vertices:

Delta Update Functions: Expose asynchrony in high-degree vertices

Summary of GraphLab2

Gather Apply Scatter

Lessons LearnedMachine Learning:

Asynchronous often much faster than SynchronousDynamic computation often faster

However, can be difficult to define optimal thresholds:

Science to do!

Consistency can improve performance

Sometimes required for convergenceThough there are cases where relaxed consistency is sufficient

System:Distributed asynchronous systems are harder to build

But, no distributed barriers == better scalability and performance

Scaling up by an order of magnitude requires rethinking of design assumptions

E.g., distributed graph representation

High degree vertices & natural graphs can limit parallelism

Need further assumptions on update functions

SummaryAn abstraction tailored to Machine Learning

Targets Graph-Parallel Algorithms

Naturally expressesData/computational dependenciesDynamic iterative computation

Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems

Carnegie Mellon

Parallel GraphLab 1.1

Multicore Available TodayGraphLab2 (in the Cloud)

soon…

http://graphlab.orgDocumentation… Code… Tutorials…

danny bickson

Documents

everlab workshop, june 7-8, 2006, jerusalem working with...

danny elfman

danny bickson - python based predictive analytics with...

joseph gonzalez - peoplejegonzal/assets/slides/...joseph...

danny sling - bird & cronin · danny sling reorder no....

danny yount

danny history

downtown danny

danny cherian

danny mukendi

danny collins

joseph gonzalez joint work with yucheng low aapo kyrola...

carnegie mellon university joseph gonzalez joint work with...

danny portfolio

danny reible

create - nvidia€¦ · danny bickson co-founder create tm....

carnegie mellon yucheng low aapo kyrola danny bickson a...

revista danny

joseph gonzalez yucheng low aapo kyrola danny bickson joe...

egypt danny