yucheng low

Carnegie Mellon

Yucheng Low

AapoKyrola

DannyBickson

A Framework for Machine Learning and Data Mining in the Cloud

JosephGonzalez

CarlosGuestrin

JoeHellerstein

Big Data is Everywhere

72 Hours a MinuteYouTube28 Million

Wikipedia Pages

900 MillionFacebook Users

6 Billion Flickr Photos

“… data a new class of economic asset, like currency or gold.”

“…growing at 50 percent a year…”

How will wedesign and implement Big learning systems?

Big Learning

Shift Towards Use Of Parallelism in ML

GPUs Multicore Clusters Clouds Supercomputers

ML experts repeatedly solve the same parallel design challenges:

Race conditions, distributed state, communication…

Resulting code is very specialized:difficult to maintain, extend, debug…

Graduate students

Avoid these problems by using high-level abstractions

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

No Communication needed

Embarrassingly Parallel independent computation No Communication needed

CPU 1 CPU 2

MapReduce – Reduce Phase

Image Features

Attractive Face Statistics

Ugly Face Statistics

U A A U U U A A U A U A

Attractive Faces Ugly Faces

MapReduce for Data-Parallel ML

Excellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

MapReduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

Is there more toMachine Learning

Carnegie Mellon

Exploit Dependencies

HockeyScuba Diving

Underwater Hockey

Scuba Diving

Hockey

Graphs are Everywhere

Movies

Netflix

Text Analysis

Social Network

Probabilistic Analysis

Properties of Computation on Graphs

DependencyGraph

IterativeComputation

My Interests

Friends Interests

LocalUpdates

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Tensor Factorization

Bayesian Tensor Factorization

Gibbs Sampling

MatrixFactorization

Belief Propagation

PageRank

…Many others…Linear Solvers

Splash SamplerAlternating Least

Squares

2010Shared Memory

Limited CPU PowerLimited MemoryLimited Scalability

Distributed Cloud

- Distributing State- Data Consistency- Fault Tolerance

Unlimited amount of computation resources!(up to funding limitations)

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Data GraphData associated with vertices and edges

Vertex Data:• User profile• Current interests estimates

Edge Data:• Relationship (friend, classmate, relative)

Graph:• Social Network

Distributed Graph

Partition the graph across multiple machines.

Distributed Graph

Ghost vertices maintain adjacency structure and replicate remote data.

“ghost” vertices

Distributed Graph

Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …)

“ghost” vertices

Consistency Model

Pagerank(scope){ // Update the current vertex data

// Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }

Update FunctionsUser-defined program: applied to avertex and transforms data in scope of vertex

Dynamic computation

Update function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation

Why Dynamic?

Asynchronous Belief Propagation

Cumulative Vertex Updates

ManyUpdates

FewUpdates

Algorithm identifies and focuses on hidden sequential structure

Graphical Model

Challenge = Boundaries

Shared Memory Dynamic Schedule

dcbaCPU 1

Process repeats until scheduler is empty

Scheduler

Distributed Scheduling

Each machine maintains a schedule over the vertices it owns.

34Distributed Consensus used to identify completion

Ensuring Race-Free CodeHow much can computation overlap?

Consistency Model

PageRank Revisited

Pagerank(scope) {

Racing PageRankThis was actually encountered in user code.

Pagerank(scope) {

Throughput != Performance

No Consistency

Higher Throughput

(#updates/sec)

Potentially Slower Convergence of ML

Racing Collaborative Filtering

0 200 400 600 800 1000 1200 1400 16000.01

d=20 Racing d=20

Serializability

For every parallel execution, there exists a sequential execution of update functions which produces the same result.

SingleCPU

Parallel

Sequential

Serializability Example

Update functions one vertex apart can be run in parallel.

Edge Consistency

Overlapping regions are only read.

Stronger / Weaker consistency levels available

User-tunable consistency levelstrades off parallelism & consistency

Solution 1Graph Coloring

Distributed Consistency

Solution 2Distributed Locking

Edge Consistency via Graph Coloring

Vertices of the same color are all at least one vertex apart.Therefore, All vertices of the same color can be run in parallel!

Chromatic Distributed EngineTi

Execute tasks on all vertices of

color 0Execute tasks

on all vertices of color 0

Ghost Synchronization Completion + Barrier

color 1

Ghost Synchronization Completion + Barrier

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges

Netflix

Movies

Users Movies

Netflix Collaborative Filtering

# machines

HadoopMPI

GraphLab

# machines

The Cost of Hadoop

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the cat

Australia

Istanbul

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Cat” an animal?Is “Istanbul” a place?

Vertices: 2 MillionEdges: 200 Million

Distributed GL 32 EC2 Nodes 80 secs

GraphLab 16 Cores 30 min

0.3% of Hadoop time

ProblemsRequire a graph coloring to be available.

Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round.

Solution 1Graph Coloring

Distributed Consistency

Solution 2Distributed Locking

Distributed LockingEdge Consistency can be guaranteed through locking.

: RW Lock

Consistency Through LockingAcquire write-lock on center vertex, read-lock on adjacent.

SolutionPipelining

CPU Machine 1

Machine 2

A CB D

Consistency Through LockingMulticore Setting

PThread RW-Locks

Distributed SettingDistributed LocksChallenges

Latency

A CB DA

No Pipelining

lock scope 1

Process request 1

scope 1 acquiredupdate_function 1

release scope 1

Process release 1

Pipelining / Latency HidingHide latency using pipelining

lock scope 1

Process request 1

scope 1 acquired

update_function 1release scope 1

Process release 1

lock scope 2

e lock scope 3 Process request 2Process request 3

scope 2 acquiredscope 3 acquired

Latency HidingHide latency using request buffering

lock scope 1

Process request 1

scope 1 acquired

Process release 1

lock scope 2

e lock scope 3 Process request 2Process request 3

scope 2 acquiredscope 3 acquired

No Pipelining 472 s

Pipelining 10 s

47x Speedup

Residual BP on 190K Vertex 560K Edge Graph4 Machines

Video Cosegmentation

Segments mean the same

1740 FramesModel: 10.5 million nodes, 31 million edges

Probabilistic Inference Task

Video Coseg. Speedups

GraphLab

# machines

Consistency Model

What if machines fail? How do we provide fault tolerance?

Checkpoint1: Stop the world

2: Write state to disk

Snapshot Performance

No Snapshot

Snapshot

One slow machine

Because we have to stop the world, One slow machine slows everything down!

Snapshot time

Slow machine

How can we do better?

Take advantage of consistency

Checkpointing1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems.

snapshottedNot snapshotted

CheckpointingFine Grained Chandy-Lamport.

Easily implemented within GraphLab as an Update Function!

Async. Snapshot Performance

No Snapshot

Snapshot

One slow machine

No penalty incurred by the slow machine!

SummaryExtended GraphLab abstraction to distributed systemsTwo different methods of achieving consistency

Graph ColoringDistributed Locking with pipelining

Efficient implementationsAsynchronous Fault Tolerance with fined-grained Chandy-Lamport

Performance

Useability

Efficiency Scalabilitys

Carnegie Mellon University

Release 2.1 + many toolkitshttp://graphlab.org

PageRank: 40x faster than Hadoop

1B edges per second.

Triangle Counting: 282x faster than Hadoop

400M triangles per second.

Major Improvements to be published in OSDI 2012

yucheng low

data parallel

dataparallel mlexcellent

mapreduce map phase7

mapreduce map phase8

mapreduce map phase5

large dataparallel tasks

parallel design challenges

data mining

Documents

1 business to business: pharmaceuticals yi-fan (eddie) fang...

a system for distributed graph-parallel machine...

design of next generation internet based on application...

joseph gonzalez yucheng low aapo kyrola danny bickson joe...

exposing congestion attack on emerging connected vehicle...

sparse newton’s...

mixture selection, mechanism design, and...

experiments and modeling of water and energy transfer in...

1342 …yucheng/ycheng_ton14_2.pdf · 2016. 5. 19. · 1344...

parallel&gibbs&sampling -...

joseph gonzalez - peoplejegonzal/assets/slides/...joseph...

design of next generation internet based on application...

deep learning meets wireless network optimization...

new yucheng hang yupeng chen arxiv:2009.05907v2 [eess.iv] 15...

yucheng liu - me.msstate.edu

joseph gonzalez joint work with yucheng low aapo kyrola...

targeted observations and observing system simulation...

yucheng chang portfolio

sounding shanghai: sinophone intermediality in jin yucheng

liang wei, yucheng zhang, xiaolu pang and kewei gao