carlos guestrin
DESCRIPTION
2. Parallel Machine Learning for Large-Scale Natural Graphs. Carlos Guestrin. The GraphLab Team:. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Joe Hellerstein. Jay Gu. Alex Smola. Parallelism is Difficult. Wide array of different parallel architectures: - PowerPoint PPT PresentationTRANSCRIPT
Carnegie Mellon University
Carlos Guestrin
YuchengLow
AapoKyrola
DannyBickson
JoeHellerstein
AlexSmola
Parallel Machine Learning for Large-Scale Natural Graphs
JayGu
2
JosephGonzalez
The GraphLab Team:
Parallelism is DifficultWide array of different parallel architectures:
Different challenges for each architecture
GPUs Multicore Clusters Clouds Supercomputers
High Level Abstractions to make things easier
How will wedesign and implement
parallel learning systems?
Map-Reduce / HadoopBuild learning algorithms on-top of
high-level parallel abstractions
... a popular answer:
BeliefPropagation
Label Propagation
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Example of Graph Parallelism
PageRank ExampleIterate:
Where:α is the random reset probabilityL[j] is the number of links on page j
1 32
4 65
Properties of Graph Parallel Algorithms
DependencyGraph
IterativeComputation
My Rank
Friends Rank
LocalUpdates
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Addressing Graph-Parallel MLWe need alternatives to Map-Reduce
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Map Reduce?Pregel (Giraph)?
Barrie
rPregel (Giraph)
Bulk Synchronous Parallel Model:
Compute Communicate
Bulk synchronous computation can be
highly inefficient
Problem:
BSP Systems Problem:Curse of the Slow Job
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Data
Data
Data
Data
Data
Data
Data
Barr
ier
Bulk synchronous computation model provably inefficient for some ML tasks
BSP ML Problem: Data-Parallel Algorithms can be Inefficient
Limitations of bulk synchronous model can lead to provably inefficient parallel algorithms
1 2 3 4 5 6 7 80
100020003000400050006000700080009000
Number of CPUs
Runti
me
in S
econ
ds
Bulk Synchronous (Pregel)
Asynchronous Splash BP
But distributed Splash BP was built from scratch… efficient, parallel implementation was painful,
painful, painful to achieve
BeliefPropagationSVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
The Need for a New AbstractionIf not Pregel, then what?
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Pregel (Giraph)
The GraphLab SolutionDesigned specifically for ML needs
Express data dependenciesIterative
Simplifies the design of parallel programs:Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architectures
MulticoreDistributedCloud computing GPU implementation in progress
What is GraphLab?
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge.
Vertex Data:• User profile text• Current interests estimates
Edge Data:• Similarity weights
Graph:• Social Network
pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }
;][)1(][][
iNj
ji jRWiR
Update FunctionsAn update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
Dynamic computation
The Scheduler
CPU 1
CPU 2
The scheduler determines the order that vertices are updated
e f g
kjih
dcba b
ih
a
i
b e f
j
c
Sch
edule
r
The process repeats until the scheduler is empty
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
Ensuring Race-Free CodeHow much can computation overlap?
Need for Consistency?
No Consistency
Higher Throughput
(#updates/sec)
Potentially Slower Convergence of ML
Inconsistent ALS
0 2000000 4000000 6000000 8000000
2
20 Dynamic Inconsistent
Dynamic
Updates
Trai
n RM
SE
Netflix data, 8 cores
Consistent
Even Simple PageRank can be Dangerous
GraphLab_pagerank(scope) {ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors )
sum = sum + neighbor.value / nbr.num_out_edges
sum = ALPHA + (1-ALPHA) * sum…
Inconsistent PageRank
Even Simple PageRank can be Dangerous
GraphLab_pagerank(scope) {ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors)
sum = sum + neighbor.value / nbr.num_out_edges
sum = ALPHA + (1-ALPHA) * sum…
CPU 1 CPU 2Read
Read-write race CPU 1 reads bad PageRank estimate, as CPU 2 computes value
Race Condition Can Be Very SubtleGraphLab_pagerank(scope) {
ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors)
sum = sum + neighbor.value / neighbor.num_out_edges
sum = ALPHA + (1-ALPHA) * sum…
GraphLab_pagerank(scope) {sum = 0forall (neighbor in scope.in_neighbors)
sum = sum + neighbor.value / nbr.num_out_edges
sum = ALPHA + (1-ALPHA) * sumscope.center_value = sum …
Uns
tabl
eSt
able
This was actually encountered in user code.
GraphLab Ensures Sequential Consistency
For each parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
Consistency Rules
Guaranteed sequential consistency for all update functions
Data
Full Consistency
Obtaining More Parallelism
Edge Consistency
CPU 1 CPU 2
Safe
Read
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
Bayesian Tensor Factorization
Gibbs SamplingDynamic Block Gibbs Sampling
MatrixFactorization
Lasso
SVM
Belief PropagationPageRank
CoEM
K-Means
SVD
LDA
…Many others…Linear Solvers
Splash SamplerAlternating Least
Squares
GraphLab vs. Pregel (BSP)
Multicore PageRank (25M Vertices, 355M Edges)
0 10 20 30 40 50 60 701
100
10000
1000000
100000000
Number of Updates
Num
-Ver
tices 51% updated only once
02000
40006000
800010000
1200014000
1.00E-02
1.00E+00
1.00E+02
1.00E+04
1.00E+06
1.00E+08
Runtime (s)
L1 E
rror
GraphLab
Pregel(via GraphLab)
0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+091.00E-02
1.00E+00
1.00E+02
1.00E+04
1.00E+06
1.00E+08
Updates
L1 E
rror
GraphLab
Pregel(via GraphLab)
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal?Is “Catalina” a place?
Vertices: 2 MillionEdges: 200 Million
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bett
er
Optimal
GraphLab CoEM
CoEM (Rosie Jones, 2005)
51
GraphLab 16 Cores 30 min
15x Faster!6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
Carnegie Mellon
GraphLab in the Cloud
CoEM (Rosie Jones, 2005)
Optimal
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Bett
er
Small
Large
GraphLab 16 Cores 30 min
Hadoop 95 Cores 7.5 hrs
GraphLabin the Cloud
32 EC2 machines
80 secs
0.3% of Hadoop time
Video Cosegmentation
Segments mean the same
Model: 10.5 million nodes, 31 million edges
Gaussian EM clustering + BP on 3D grid
Video Coseg. Speedups
GraphLab
Ideal
Video Coseg. Speedups
GraphLab
Ideal
Cost-Time Tradeoff video co-segmentation results
more machines, higher cost
fast
er
a few machines helps a lot
diminishingreturns
Netflix Collaborative FilteringAlternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
DHadoopMPI
GraphLab
Ideal
D=100
D=20
Multicore Abstraction Comparison
Netflix Matrix Factorization
0 2000000 4000000 6000000 8000000 10000000-0.036
-0.034
-0.032
-0.03
-0.028
-0.026
-0.024
-0.022
-0.02
DynamicRound Robin
Updates
Log
Test
Err
or
Dynamic Computation,Faster Convergence
The Cost of Hadoop
Carnegie Mellon University
Fault Tolerance
Fault-ToleranceLarger Problems Increased chance of Machine Failure
GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms
Synchronous SnapshotsChandi-Lamport Asynchronous Snapshots
Synchronous Snapshots
Run GraphLab Run GraphLab
Barrier + Snapshot
Tim
e
Run GraphLab Run GraphLab
Barrier + Snapshot
Run GraphLab Run GraphLab
Curse of the slow machine
sync.Snapshot
No Snapshot
Curse of the Slow Machine
Run GraphLab
Run GraphLab
Tim
e
Barrier + Snapshot
Run GraphLabRun GraphLab
Curse of the slow machine
sync.Snapshot
No Snapshot
Delayed sync.Snapshot
Asynchronous Snapshots
struct chandy_lamport { void operator()(icontext_type& context) {
save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() )
{if (edge.source() was not marked as
saved) {save(context.edge_data(edge));context.schedule(edge.source(),
chandy_lamport());}
}... Repeat for context.out_edgesMark context.vertex() as saved;
}};
Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency
Snapshot Performance
Async.Snapshot
sync.Snapshot
No Snapshot
Snapshot with 15s fault injection
No SnapshotAsync.
Snapshot
sync.Snapshot
Halt 1 out of 16 machines 15s
Why do we need to update the GraphLab
Abstraction?
Natural Graphs
Natural Graphs Power Law
Top 1% of vertices is adjacent to
53% of the edges!
Yahoo! Web Graph: 1.4B Verts, 6.7B Edges
“Power Law”
Problem: High Degree Vertices
High degree vertices limit parallelism:
Touch a LargeAmount of State
Requires Heavy Locking
Processed Sequentially
Split gather and scatter across machines:
High Communication in Distributed Updates
Y
Machine 1 Machine 2
Data from neighbors transmitted separately
across network
High Degree Vertices are Common
Use
rs
Movies
Netflix
“Social” People Popular Movies
θZwZwZwZw
θZwZwZwZw
θZwZwZwZw
θZwZwZwZw
Bα
Hyper Parameters
Doc
s
Words
LDA
Common Words
Obama
Factorized Update Functors
Delta Update Functors
Two Core Changes to Abstraction
Monolithic Updates
++
++ +
+
++
Gather Apply Scatter
Decomposed Updates
Monolithic Updates Composable Update “Messages”
f1 f2
(f1o f2)( )
Decomposable Update Functors
Locks are acquired only for region within a scope Relaxed Consistency
+ + … + Δ
Y YY
ParallelSum
User Defined:
Gather( ) ΔY
Δ1 + Δ2 Δ3
Y Scope
Gather
Y
YApply( , Δ) Y
Apply the accumulated value to center vertex
User Defined:
Apply
Y
Scatter( )
Update adjacent edgesand vertices.
User Defined:Y
Scatter
Factorized PageRankdouble gather(scope, edge) {
return edge.source().value().rank /
scope.num_out_edge(edge.source())}
double merge(acc1, acc2) { return acc1 + acc2 }
void apply(scope, accum) {old_value = scope.center_value().rankscope.center_value().rank = ALPHA + (1 - ALPHA) *
accumscope.center_value().residual =
abs(scope.center_value().rank – old_value)}
void scatter(scope, edge) {if (scope.center_vertex().residual > EPSILON)
reschedule_schedule(edge.target())}
Y
Split gather and scatter across machines:
Factorized Updates: Significant Decrease in Communication
( o )( )Y
YYF1 F2
YY
Small amount of data transmitted over network
Factorized ConsistencyNeighboring vertices maybe be updated simultaneously:
A B
Gather Gather
Apply
Factorized Consistency LockingGather on an edge cannot occur during apply:
A B
Gather
Vertex B gathers on other neighbors while A is performing Apply
Decomposable Loopy Belief Propagation
Gather: Accumulates product of in messages
Apply: Updates central belief
Scatter: Computes out messages and schedules adjacent vertices
Decomposable Alternating Least Squares (ALS)
y1
y2
y3
y4
w1
w2
x1
x2
x3Use
r Fac
tors
(W)
Movie Factors (X)
Use
rs
Movies
Netflix
Use
rs
≈x
Movies
Gather: Sum terms
wi
xj
Update Function:
Apply: matrix inversion & multiply
Comparison of Abstractions
Multicore PageRank (25M Vertices, 355M Edges)
0 1000 2000 3000 4000 5000 60001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08
Runtime (s)
L1 E
rror
GraphLab1
FactorizedUpdates
Need for Vertex Level Asynchrony
Exploit commutative associative “sum”
Y
+ + + + + Y
Costly gather for a single change!
Commut-Assoc Vertex Level Asynchrony
Exploit commutative associative “sum”
+ + + + + Y
Y
Exploit commutative associative “sum”
+ + + + + + Δ Y
Y
Commut-Assoc Vertex Level Asynchrony
+ Δ
Delta Updates: Vertex Level Asynchrony
Exploit commutative associative “sum”
+ + + + + + Δ YOld (Cached) Sum
Y
Exploit commutative associative “sum”
YΔ Δ
Delta Updates: Vertex Level Asynchrony
+ + + + + + Δ YOld (Cached) Sum
Delta Update
void update(scope, delta) {scope.center_value() = scope.center_value() +
deltaif(abs(delta) > EPSILON) {
out_delta = delta * (1 – ALPHA) *1 /
scope.num_out_edge(edge.source())reschedule_out_neighbors(delta)
}}
double merge(delta, delta) { return delta + delta }
Program starts with: schedule_all(ALPHA)
Multicore Abstraction Comparison
Multicore PageRank (25M Vertices, 355M Edges)
0 2000 4000 6000 8000 10000 12000 140001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08
Delta
Factorized
GraphLab 1
Simulated Pregel
Runtime (s)
L1 E
rror
Distributed Abstraction Comparison
Distributed PageRank (25M Vertices, 355M Edges)
2 3 4 5 6 7 80
50
100
150
200
250
300
350
400
# Machines (8 CPUs per Machine)
Runti
me
(s)
2 3 4 5 6 7 80
5
10
15
20
25
30
35
# Machines (8 CPUs per Machine)
Tota
l Com
mun
icati
on (G
B)
GraphLab1
GraphLab2 (Delta Updates)
GraphLab1
GraphLab2 (Delta Updates)
PageRankAltavista Webgraph 2002
1.4B vertices, 6.7B edges
Hadoop 9000 s800 cores
Prototype GraphLab2 431s512 cores
Known Inefficiencies.
2x gain possible
Decomposed Update Functions: Expose parallelism in high-degree vertices:
Delta Update Functions: Expose asynchrony in high-degree vertices
Summary of GraphLab2
++
++ +
+
++
Gather Apply Scatter
Y YΔ
Lessons LearnedMachine Learning:
Asynchronous often much faster than SynchronousDynamic computation often faster
However, can be difficult to define optimal thresholds:
Science to do!
Consistency can improve performance
Sometimes required for convergenceThough there are cases where relaxed consistency is sufficient
System:Distributed asynchronous systems are harder to build
But, no distributed barriers == better scalability and performance
Scaling up by an order of magnitude requires rethinking of design assumptions
E.g., distributed graph representation
High degree vertices & natural graphs can limit parallelism
Need further assumptions on update functions
Startups Using GraphLab
2000++ Unique Downloads Tracked(possibly many more from direct repository checkouts)
Companies experimenting (or downloading) with GraphLab
Academic projects exploring (or downloading) GraphLab
Yucheng
GraphLab Matrix Factorization Library
Used in ACM KDD Cup 2011 – track1 5th place out of more than 1000 participants [Wu et al.]
2 orders of magnitude faster than MahoutBlended 12 matrix factorization algorithms
SummaryAn abstraction tailored to Machine Learning
Targets Graph-Parallel Algorithms
Naturally expressesData/computational dependenciesDynamic iterative computation
Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems
Carnegie Mellon
Parallel GraphLab 1.1
Multicore Available TodayGraphLab2 (in the Cloud)
soon…
http://graphlab.org
Documentation… Code… Tutorials…
Next slide is an extra slide if people ask about running the distributed asynchronous delta (which is basically
asynchronous message passing), in a synchronous fashion. (i.e. use Deltas in Pregel implementation).
Distributed Abstraction ComparisonDistributed PageRank (25M Vertices, 355M Edges)
2 3 4 5 6 7 80
50
100
150
200
250
300
350
400
# Machines (8 CPUs per Machine)
Runti
me
(s)
2 3 4 5 6 7 80
5
10
15
20
25
30
35GL 1 (Chromatic)
GL 2 Delta (Synchronous)
GL 2 Delta (Asynchronous)
# Machines (8 CPUs per Machine)
Tota
l Com
mun
icati
on (G
B)
Update Count Distribution
0 10 20 30 40 50 60 700
2000000
4000000
6000000
8000000
10000000
12000000
14000000
Number of Updates
Num
-Ver
tices
Most vertices need to be updated infrequently