carnegie mellon parallel splash belief propagation joseph e. gonzalez yucheng low carlos guestrin...

Post on 31-Mar-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Carnegie Mellon

Parallel Splash Belief Propagation

Joseph E. GonzalezYucheng Low

Carlos GuestrinDavid O’Hallaron

Computers which worked on this project:

BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFSTashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30,

parallel, gs6167, koobcam (helped with writing)

2

Why talk about parallelism now?1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

Change in the Foundation of ML

Past

Sequential P

erform

ance

Future SequentialPerformance

Log

(Sp

eed

in G

Hz)

Future Parallel Performance

Release Date

3

Why is this a Problem?

Sophistication

Para

llelis

m

Nearest Neighbor[Google et al.]

Basic Regression[Cheng et al.]

Graphical Models[Mendiburu et al.]

Support Vector Machines[Graf et al.]

Want to behere

4

Why is it hard?

Algorithmic Efficiency

Parallel Efficiency

ImplementationEfficiency

Eliminate wastedcomputation

Expose independentcomputation

Map computation toreal hardware

The Key Insight

5

Statistical Structure• Graphical Model Structure• Graphical Model Parameters

Computational Structure• Chains of Computational Dependences• Decay of Influence

Parallel Structure• Parallel Dynamic Scheduling• State Partitioning for Distributed

Computation

6

The Result

Nearest Neighbor[Google et al.]

Basic Regression[Cheng et al.]

Support Vector Machines[Graf et al.]

GoalSplash Belief Propagation

Graphical Models[Gonzalez et al.]

Sophistication

Para

llelis

m

Graphical Models[Mendiburu et al.]

7

OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash

Dynamic SchedulingPartitioning

Experimental ResultsConclusions

8

Graphical Models and Parallelism

Graphical models provide a common language for general purpose parallel algorithms in machine learning

A parallel inference algorithm would improve:

Protein Structure Prediction

Inference is a key step in Learning Graphical Models

Computer VisionMovie Recommendation

Overview of Graphical Models

Graphical represent of local statistical dependencies

9

Observed Random Variables

Late

nt

Pixe

l Vari

able

s Loca

l Dependencie

s

Noisy Picture

InferenceWhat is the probability that this pixel is black?

“True” Pixel Values

Continuity Assumptions

Synthetic Noisy Image Problem

Overlapping Gaussian noiseAssess convergence and accuracy

Noisy Image Predicted Image

11

Protein Side-Chain Prediction

Model side-chain interactions as a graphical model

Protein Backbone

Sid

e-C

hain

Sid

e-C

hain

Side

-Cha

in

Side-Chain

Sid

e-C

hain

What is the most likely orientation?

Inference

12

Protein Side-Chain Prediction

276 Protein Networks:Approximately:

700 Variables1600 Factors70 Discrete orientations

Strong Factors

Protein B

ackbon

e

Side-Chain

Side-Chain

Side-Chain

Side-Chain

Side-Chain

5.8

10.6

15.4

20.2 25

29.8

34.6

39.4

44.2 49

0

0.1

Example Degree Distribution

Degree

13

Smokes(A) Cancer(A) Smokes(B) Cancer(B)

Friends(A,B) And Smokes(A) Smokes(B)

Markov Logic NetworksRepresent Logic as a graphical model

Cancer(A) Cancer(B)

Smokes(A) Smokes(B)

Friends(A,B)A: AliceB: Bob

True/False?

Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ?

Inference

14

Markov Logic Networks

Smokes(A) Cancer(A) Smokes(B) Cancer(B)

Friends(A,B) And Smokes(A) Smokes(B)

Cancer(A)Cancer(B)

Smokes(A) Smokes(B)

Friends(A,B)A: AliceB: Bob

True/False?UW-Systems Model8K Binary Variables406K Factors

Irregular degree distribution:

Some vertices with high degree

15

OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash

Dynamic SchedulingPartitioning

Experimental ResultsConclusions

16

The Inference Problem

NP-Hard in General Approximate Inference:

Belief Propagation

Smokes(A) Cancer(A) Smokes(B) Cancer(B)

Friends(A,B) And Smokes(A) Smokes(B)

Cancer(A) Cancer(B)

Smokes(A) Smokes(B)

Friends(A,B)A: AliceB: Bob

True/False?

Protein Backbone

Sid

e-C

hain

Sid

e-C

hain

Side

-Cha

in

Side-Chain

Sid

e-C

hainWhat is the probability

that Bob Smokes givenAlice Smokes?

What is the best configuration of the protein side-chains?

What is the probabilitythat each pixel is black?

17

Belief Propagation (BP)Iterative message passing algorithm

Naturally Parallel Algorithm

18

Parallel Synchronous BPGiven the old messages all new messages can be computed in parallel:

NewMessages

OldMessages

CPU 2

CPU 1

CPU 3

CPU n

Map-Reduce Ready!

19

Sequential Computational Structure

20

Hidden Sequential Structure

21

Hidden Sequential Structure

Running Time:

EvidenceEvidence

Time for a singleparallel iteration

Number of Iterations

Optimal Sequential Algorithm

Forward-Backward

Naturally Parallel

2n2/p

p ≤ 2n

22

RunningTime

2n

Gap

p = 1

Optimal Parallel

n

p = 2

Key Computational Structure

Naturally Parallel

2n2/p

p ≤ 2n

23

RunningTime

Optimal Parallel

n

p = 2

Gap Inherent Sequential Structure

Requires Efficient Scheduling

24

OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash

Dynamic SchedulingPartitioning

Experimental ResultsConclusions

25

Parallelism by Approximation

τε represents the minimal sequential structure

True Messages

τε -Approximation

1 2 3 4 5 6 7 8 9 10

1

26

Tau-Epsilon StructureOften τε decreases quickly:

Markov LogicNetworks

Protein Networks

Mess

ag

e A

pp

roxim

ati

on E

rror

in L

og S

cale

27

Running Time Lower Bound

Theorem: Using p processors it is not possible to obtain a τε approximation in time less than:

ParallelComponent

SequentialComponent

28

A single processor can only make k-τε +1 vertices left aware in k-iterations

Consider one direction using p/2 processors (p≥2):

1 n

τε τε τε τε τε τε τε

τε

n - τε

We must make n - τε vertices τε left-aware

Proof: Running Time Lower Bound

29

Optimal Parallel SchedulingProcessor 1 Processor 2 Processor 3

Theorem: Using p processors this algorithm achieves a τε approximation in time:

30

Proof: Optimal Parallel Scheduling

All vertices are left-aware of the left most vertex on their processor

After exchanging messages

After next iteration:

After k parallel iterations each vertex is (k-1)(n/p) left-aware

31

Proof: Optimal Parallel Scheduling

After k parallel iterations each vertex is (k-1)(n/p) left-awareSince all vertices must be made τε left aware:

Each iteration takes O(n/p) time:

32

Comparing with SynchronousBP

Processor 1 Processor 2 Processor 3

Synchronous Schedule Optimal Schedule

Gap

33

OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash

Dynamic SchedulingPartitioning

Experimental ResultsConclusions

34

The Splash OperationGeneralize the optimal chain algorithm:

to arbitrary cyclic graphs:

~

1) Grow a BFS Spanning tree with fixed size

2) Forward Pass computing all messages at each vertex

3) Backward Pass computing all messages at each vertex

35

Local State

CPU 2

Local State

CPU 3

Local State

CPU 1

Running Parallel Splashes

Partition the graphSchedule Splashes locallyTransmit the messages along the boundary of the partition

Splash SplashSplash

Key Challenges:1) How do we schedules Splashes?2) How do we partition the Graph?

Local State

Sch

edu

ling

Queu

e

Where do we Splash?Assign priorities and use a scheduling queue to select roots:

Splash

Splash

??

?

CPU 1

How do we assign priorities?

37

Message SchedulingResidual Belief Propagation [Elidan et al., UAI 06]:

Assign priorities based on change in inbound messages

1

Message

Message

Message

2

Message

Message

Message

Large ChangeSmall Change

Small Change

Large Change

Small Change:Expensive No-Op

Large Change:Informative Update

38

Problem with Message Scheduling

Small changes in messages do not imply small changes in belief:

Small change inall message

Large change inbelief

Message

Message

BeliefMessage

Message

39

Problem with Message Scheduling

Large changes in a single message do not imply large changes in belief:

Large change ina single message

Small changein belief

Message

BeliefMessageMessage

Message

40

Belief Residual Scheduling

Assign priorities based on the cumulative change in belief:

1 1

+1

+rv =

MessageChange

A vertex whose belief has changed substantially

since last being updatedwill likely produce

informative new messages.

41

Message vs. Belief Scheduling

Belief Scheduling improves accuracy and convergence

Belief Residuals

Message Residual

0%

20%

40%

60%

80%

100%

% Converged in 4Hrs

Bett

er

0 20 40 60 80 100

0.02

0.03

0.04

0.05

0.06

Error in BeliefsMessage Schedul-ing

Belief Scheduling

Time (Seconds)

L1

Err

or

in B

eliefs

Splash PruningBelief residuals can be used to dynamically reshape and resize Splashes:

LowBeliefs

Residual

43

Splash SizeUsing Splash Pruning our algorithm is able to dynamically select the optimal splash size

0 10 20 30 40 50 600

50

100

150

200

250

300

350Without Pruning

With Pruning

Splash Size (Messages)

Ru

nn

ing

Tim

e (

Secon

ds)

Bett

er

44

Example

Synthetic Noisy Image

Factor Graph

Vertex Updates

ManyUpdate

s

FewUpdate

s

Algorithm identifies and focuses on hidden sequential structure

45

Parallel Splash Algorithm

Partition factor graph over processorsSchedule Splashes locally using belief residualsTransmit messages on boundary

Local State

CPU 1

Splash

Local State

CPU 2

Local State

CPU 3

Splash

Fast Reliable Network

Splash

SchedulingQueue

SchedulingQueue

SchedulingQueue

Given a uniform partitioning of the chain graphical model, Parallel Splash will run in time:

retaining optimality.

Theorem:

46

CPU 1 CPU 2

Partitioning ObjectiveThe partitioning of the factor graph determines:

Storage, Computation, and Communication

Goal: Balance Computation and Minimize Communication

EnsureBalanceComm.

cost

47

The Partitioning ProblemObjective:

Depends on:

NP-Hard METIS fast partitioning heuristic

Work:

Comm:

Minimize Communication

Ensure Balance

Update counts are not known!

48

Unknown Update Counts Determined by belief schedulingDepends on: graph structure, factors, …Little correlation between past & future update counts

Noisy Image Update CountsSimple Solution:

Uninformed Cut

49

Uniformed Cuts

Greater imbalance & lower communication cost

Update Counts Uninformed Cut Optimal Cut

Denoise UW-Syst.0

1

2

3

4Work Imbalance

Denoise UW-Syst.0.5

0.7

0.9

1.1

Communication Cost

Unin-formedB

ett

er

Bett

er

Too Much Work

Too Little Work

50

Over-PartitioningOver-cut graph into k*p partitions and randomly assign CPUs

Increase balanceIncrease communication cost (More Boundary)

CPU 1

CPU 2

CPU 1 CPU 2 CPU 2

CPU 1 CPU 1 CPU 2

CPU 1 CPU 2 CPU 1

CPU 2 CPU 1 CPU 2

Without Over-Partitioning

k=6

51

Over-Partitioning ResultsProvides a simple method to trade between work balance and communication cost

-1 1 3 5 7 9 11 13 151

1.5

2

2.5

3

3.5

4

Communication Cost

Partition Factor k

-1 1 3 5 7 9 11 13 151.5

2

2.5

3

3.5

Work Imbalance

Partition Factor k

Bett

er

Bett

er

52

CPU UtilizationOver-partitioning improves CPU utilization:

0 50 100 150 200 2500

10203040506070

UW-Systems MLN

Time (Seconds)

Acti

ve C

PU

s

0 10 20 30 40 500

10203040506070

Denoise

No Over-Part

10x Over-Part

Time (Seconds)

53

Parallel Splash Algorithm

Over-Partition factor graph Randomly assign pieces to processors

Schedule Splashes locally using belief residualsTransmit messages on boundary

Local State

CPU 1

Splash

Local State

CPU 2

Local State

CPU 3

Splash

Fast Reliable Network

Splash

SchedulingQueue

SchedulingQueue

SchedulingQueue

54

OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash

Dynamic SchedulingPartitioning

Experimental ResultsConclusions

55

ExperimentsImplemented in C++ using MPICH2 as a message passing API

Ran on Intel OpenCirrus cluster: 120 processors

15 Nodes with 2 x Quad Core Intel Xeon ProcessorsGigabit Ethernet Switch

Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]

Present results on largest UW-Systems and smallest UW-Languages MLNs

56

Parallel Performance (Large Graph)

0 30 60 90 1200

20

40

60

80

100

120

No Over-Part

5x Over-Part

Number of CPUs

Sp

eed

up

UW-Systems8K Variables406K Factors

Single Processor Running Time:

1 Hour

Linear to Super-Linear up to 120 CPUs

Cache efficiency

Linear

Bett

er

57

Parallel Performance (Small Graph)

UW-Languages1K Variables27K Factors

Single Processor Running Time:

1.5 Minutes

Linear to Super-Linear up to 30 CPUs

Network costs quickly dominate short running-time

0 30 60 90 1200

10

20

30

40

50

60

No Over-Part

5x Over-Part

Number of CPUs

Sp

eed

up

Linear

Bett

er

58

OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash

Dynamic SchedulingPartitioning

Experimental ResultsConclusions

59

SummaryAlgorithmic Efficiency

Parallel Efficiency

ImplementationEfficiency

Independent Parallel Splashes

Splash Structure +

Belief Residual Scheduling

Distributed QueuesAsynchronous Communication

Over-Partitioning

Experimental results on large factor graphs:

Linear to super-linear speed-up using up to 120 processors

Sophistication

Para

llelis

m

Parallel Splash Belief Propagation

We are here

We were here

Conclusion

61

Questions

62

Protein Results

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

10

Number of Cores

Rel

ativ

e S

peed

up

Residual BP

Linear Speedup

ResidualSplash BP

MapReduce BP

63

3D Video Task

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

Number of Cores

Rel

ativ

e S

peed

up

Residual BP

ResidualSplash BP

Linear Speedup

MapReduce BP

64

Distributed Parallel Setting

Opportunities:Access to larger systems: 8 CPUs 1000 CPUsLinear Increase:

RAM, Cache Capacity, and Memory Bandwidth

Challenges:Distributed state, Communication and Load Balancing

Fast Reliable Network

Node

CPU BusMemory

Cache

Node

CPU BusMemory

Cache

top related